CN111680513B - Feature information identification method and device and computer readable storage medium - Google Patents

Feature information identification method and device and computer readable storage medium Download PDF

Info

Publication number
CN111680513B
CN111680513B CN202010482841.6A CN202010482841A CN111680513B CN 111680513 B CN111680513 B CN 111680513B CN 202010482841 A CN202010482841 A CN 202010482841A CN 111680513 B CN111680513 B CN 111680513B
Authority
CN
China
Prior art keywords
chinese character
industry
vector
character combination
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010482841.6A
Other languages
Chinese (zh)
Other versions
CN111680513A (en
Inventor
王伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202010482841.6A priority Critical patent/CN111680513B/en
Publication of CN111680513A publication Critical patent/CN111680513A/en
Application granted granted Critical
Publication of CN111680513B publication Critical patent/CN111680513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention relates to an artificial intelligence technology, and discloses a characteristic information identification method, which comprises the following steps: acquiring a text to be recognized of an organization name, and calculating an average word vector of Chinese character combinations corresponding to pinyin of the text to be recognized; obtaining a target industry class to which an organization name belongs and a target cluster class to which the target industry class belongs; calculating the first similarity between the average character vector of each Chinese character combination in the Chinese character combinations and the average industry vector of the target industry class; and calculating the second similarity of the average word vector of each Chinese character combination in the Chinese character combinations and the average cluster vector of the target cluster, carrying out weight calculation on the first similarities and the second similarities, and determining the Chinese character combination with the highest weight calculation score as the Chinese character combination of the organization name. The invention also provides a characteristic information identification device, electronic equipment and a computer readable storage medium. The invention can improve the accuracy of identifying the organization names of the institutions existing in the voice information.

Description

Feature information identification method and device and computer readable storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method and apparatus for identifying feature information, an electronic device, and a computer readable storage medium.
Background
The application of the voice recognition technology is very wide, and the voice recognition technology can recognize information in voice and convert the voice information into characters. In many application scenarios, it is necessary to identify the names of organizations in voice, for example, when a user handles business to business through telephone, the names of organizations in voice information are identified and converted into text.
In the prior art, in order to identify the names of organization, a large amount of corpora are required to be collected for different scenes, which consumes a large amount of time, and if the collected data are wrong, the name identification of the organization is inaccurate; if a large amount of corpus is collected without sorting in advance for different scenes, the name recognition result of the organization is not accurate enough.
Disclosure of Invention
The invention provides a method, a device, an electronic device and a computer readable storage medium for identifying characteristic information, which mainly aim to improve the accuracy of identifying organization names existing in voice information.
In order to achieve the above object, the present invention provides a method for identifying feature information, including:
receiving a text data set obtained through voice recognition, and extracting texts to be recognized of organization names from the text data set by using a named entity recognition technology;
summarizing the Chinese character combinations corresponding to the pinyin of the text to be identified to obtain a Chinese character combination candidate set, and carrying out vector calculation on at least two Chinese character combinations in the Chinese character combination candidate set to obtain average word vectors of the at least two Chinese character combinations;
obtaining a target industry category to which the organization name belongs, obtaining an industry category set containing the target industry category, and performing vector calculation on at least two industry categories containing the target industry category in the industry category set to obtain an average industry vector containing the at least two industry categories;
calculating first similarity of an average word vector of each Chinese character combination in the at least two Chinese character combinations and an average industry vector of the target industry class to obtain a first-level Chinese character combination candidate set, wherein the first-level Chinese character combination candidate set comprises the at least two Chinese character combinations and the first similarity respectively corresponding to the at least two Chinese character combinations;
Performing cluster calculation on the industry class set to obtain an industry class cluster set, acquiring a target cluster class to which the target industry class belongs from the industry class cluster set, and calculating an average cluster class vector of the target cluster class;
calculating the second similarity of the average word vector of each Chinese character combination in the at least two Chinese character combinations and the average cluster vector of the target cluster to obtain a second-level Chinese character combination candidate set, wherein the second-level Chinese character combination candidate set comprises at least the two Chinese character combinations and the second similarity respectively corresponding to the at least two Chinese character combinations;
performing weight calculation on the first similarity contained in the first-level Chinese character combination candidate set and the second similarity contained in the second-level Chinese character combination candidate set to obtain a Chinese character combination score result set;
and determining the Chinese character combination corresponding to the highest score in the Chinese character combination scoring result set as the Chinese character combination of the organization name.
Optionally, the vector calculation is performed on at least two Chinese character combinations in the Chinese character combination candidate set to obtain an average word vector of the at least two Chinese character combinations, including:
Acquiring the word vector of each Chinese character in at least two Chinese character combinations contained in the Chinese character combination candidate set by utilizing a pre-trained word vector dictionary;
and calculating the average value of the word vectors of all Chinese characters contained in each Chinese character combination in at least two Chinese character combinations according to the word vector of each Chinese character in the at least two Chinese character combinations to obtain the average word vector of the at least two Chinese character combinations.
Optionally, the obtaining the target industry category to which the organization name belongs, and obtaining the industry category set including the target industry category include:
forward and backward encoding is carried out on the text data set through a bidirectional LSTM network based on an attention mechanism, and vectors generated by the forward encoding and the backward encoding are spliced together to form a spliced vector;
inputting the spliced vector into a pre-constructed first industry classification model, and determining that the industry class output by the industry classification neural network model is the target industry class to which the organization name belongs;
classifying the Chinese character combination candidate set by using a pre-constructed second industry classification model to obtain a classification result, wherein the classification result comprises industry categories corresponding to Chinese character combinations contained in the Chinese character combination candidate set;
And combining different industry categories in the classification result with the target industry category to obtain an industry category set containing the target industry category.
Optionally, the obtaining the target industry category to which the organization name belongs, and obtaining the industry category set including the target industry category include:
acquiring a supplementary Chinese character combination candidate set, wherein the supplementary Chinese character combination candidate set comprises a supplementary organization name;
classifying the organization names by using a pre-constructed third industry classification model to obtain target industry categories to which the organization names belong;
classifying the complementary Chinese character combination candidate set by using the third industry classification model to obtain a classification result, wherein the classification result comprises industry categories corresponding to the names of complementary organization mechanisms contained in the complementary Chinese character combination candidate set;
and combining different industry categories in the classification result with the target industry category to obtain an industry category set containing the target industry category.
Optionally, the calculating the first similarity between the average word vector of each of the at least two chinese character combinations and the average industry vector of the target industry category includes:
Calculating a first similarity between an average word vector of each Chinese character combination of the at least two Chinese character combinations and an average industry vector of the target industry category through a similarity calculation function, wherein the similarity calculation function is as follows:
wherein sim (x i ,y i ) Representing the first similarity, x i Average word vector, y, representing Chinese character combinations i An average industry vector representing the target industry category, n representing a vector dimension of the average word vector or the average industry vector.
Optionally, the performing cluster calculation on the industry class set to obtain an industry class cluster set includes:
calculating the distance between any two industry categories in the industry category set, and combining the two industry categories with the minimum distance to obtain a cluster;
and circularly calculating the distance between any two industry categories in the non-combined industry categories, combining the two industry categories with the smallest distance to obtain new cluster categories until the number of the cluster categories reaches the preset number, and determining that all the obtained cluster categories form an industry category cluster set.
Optionally, the performing weight calculation on the first similarity included in the first-level kanji combination candidate set and the second similarity included in the second-level kanji combination candidate set to obtain a kanji combination score result set includes:
Multiplying a plurality of first similarities contained in the first-level Chinese character combination candidate set and a plurality of second similarities contained in the second-level Chinese character combination candidate set by the same or different weights respectively to obtain a first weight similarity set and a second weight similarity set;
and correspondingly adding a plurality of first weight similarities contained in the first weight similarity set and a plurality of second weight similarities contained in the second weight similarity set respectively to obtain a Chinese character combination scoring result set.
In order to solve the above-mentioned problem, the present invention also provides a device for identifying characteristic information, the device comprising:
the text recognition module is used for receiving a text data set obtained through voice recognition and extracting texts to be recognized of organization names from the text data set by using a named entity recognition technology;
the word vector calculation module is used for summarizing the Chinese character combinations corresponding to the pinyin of the text to be identified to obtain a Chinese character combination candidate set, and carrying out vector calculation on at least two Chinese character combinations in the Chinese character combination candidate set to obtain an average word vector of the at least two Chinese character combinations;
the industry vector calculation module is used for acquiring a target industry category to which the organization name belongs, acquiring an industry category set containing the target industry category, and carrying out vector calculation on at least two industry categories containing the target industry category in the industry category set to obtain an average industry vector containing the at least two industry categories;
The first similarity calculation module is used for calculating first similarity between an average word vector of each Chinese character combination in the at least two Chinese character combinations and an average industry vector of the target industry class to obtain a first-level Chinese character combination candidate set, wherein the first-level Chinese character combination candidate set comprises the at least two Chinese character combinations and first similarity respectively corresponding to the at least two Chinese character combinations;
the cluster vector calculation module is used for carrying out cluster calculation on the industry class set to obtain an industry class cluster set, acquiring a target cluster class to which the target industry class belongs from the industry class cluster set, and calculating an average cluster vector of the target cluster class;
a second similarity calculation module, configured to calculate a second similarity between an average word vector of each chinese character combination in the at least two chinese character combinations and an average cluster vector of the target cluster, to obtain a second-level chinese character combination candidate set, where the second-level chinese character combination candidate set includes at least two chinese character combinations and second similarities corresponding to the at least two chinese character combinations respectively;
the weight calculation module is used for carrying out weight calculation on the first similarity contained in the first-level Chinese character combination candidate set and the second similarity contained in the second-level Chinese character combination candidate set to obtain a Chinese character combination score result set;
And the determining module is used for determining the Chinese character combination corresponding to the highest score in the Chinese character combination score result set as the Chinese character combination of the organization name.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
a memory storing at least one instruction; and
And the processor executes the instructions stored in the memory to realize the identification method of the characteristic information.
In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one instruction that is executed by a processor in an electronic device to implement the above-mentioned identification method of feature information.
In the embodiment of the invention, a text to be identified of an organization name is obtained, and an average word vector of Chinese character combinations corresponding to pinyin of the text to be identified is calculated; obtaining a target industry class to which an organization name belongs and a target cluster class to which the target industry class belongs; calculating the first similarity between the average character vector of each Chinese character combination in the Chinese character combinations and the average industry vector of the target industry class; and calculating the second similarity of the average word vector of each Chinese character combination in the Chinese character combinations and the average cluster vector of the target cluster, carrying out weight calculation on the first similarities and the second similarities, and determining the Chinese character combination with the highest weight calculation score as the Chinese character combination of the organization name. The most accurate word combination can be selected from a plurality of possible word combinations of the organization, thereby achieving the purpose of improving the accuracy of identifying the organization names existing in the voice information.
Drawings
Fig. 1 is a flow chart of a method for identifying feature information according to an embodiment of the invention;
FIG. 2 is a schematic block diagram of an apparatus for identifying feature information according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an internal structure of an electronic device for implementing a method for identifying feature information according to an embodiment of the present invention;
the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a characteristic information identification method. Referring to fig. 1, a flow chart of a method for identifying feature information according to an embodiment of the invention is shown. The method may be performed by an apparatus, which may be implemented in software and/or hardware.
In this embodiment, the method for identifying feature information includes:
s1, receiving a text data set obtained through voice recognition, and extracting texts to be recognized of organization names from the text data set by using a named entity recognition technology.
In detail, the text data set is a text set obtained by converting a piece of audio into text through a voice recognition technology.
For example, the distribution meeting content recorded by the news distribution meeting time recorder recording device is converted into text content, and the text content is a text data set through a voice recognition technology.
In this embodiment, an entity (e.g., a person name, a place name, an organization name, a proper noun, etc.) having a specific meaning in the text is identified by a named entity recognition technique (Named Entity Recognition, simply called NER).
For example, the text "Xiaoming in Hawaii vacation, xiaoqiang visited Disney corporation" is identified by using a named entity identification technique, and the identification results obtained by the named entity identification technique are "Xiaoming-person name", "Hawaii-place name", "Disney corporation-organization structure name".
S2, summarizing the Chinese character combinations corresponding to the pinyin of the text to be recognized to obtain a Chinese character combination candidate set, and carrying out vector calculation on at least two Chinese character combinations in the Chinese character combination candidate set to obtain average word vectors of the at least two Chinese character combinations.
In the embodiment of the invention, the pinyin of the text to be recognized is obtained according to the text to be recognized. For example: the text to be identified is the pinyin corresponding to the 'Xiaoyang' company, which is 'xi, ooy ǐ g ō ngs ī'.
In detail, the candidate set of Chinese character combinations refers to a set of different Chinese character combinations corresponding to pinyin of a text to be recognized. Because of the presence of homophones, the pinyin of the text to be recognized may correspond to different combinations of chinese characters,
for example, the text to be identified is "xiaoyi company", the corresponding pinyin is "xiyian oy ǐ g ō ngs ī", and the Chinese character combination corresponding to "xiyian ǐ g ō ngs ī" is not only "xiaoyi company", but also "xiaoyi company", "xiaohao company" and other Chinese character combinations due to homophones, and all Chinese character combinations corresponding to the pinyin of "xiyian ǐ g ō ngs ī" are summarized to obtain the Chinese character combination candidate set.
In this embodiment, all possible Chinese character combinations are summarized together to obtain a Chinese character combination candidate set, and then according to the subsequent steps, the most accurate combination is selected from the Chinese character combination candidate set, so that the accuracy of recognition can be improved.
Further, in an optional embodiment of the present invention, the vector calculation is performed on at least two chinese character combinations in the chinese character combination candidate set to obtain an average word vector of the at least two chinese character combinations, including:
acquiring the word vector of each Chinese character in at least two Chinese character combinations contained in the Chinese character combination candidate set by utilizing a pre-trained word vector dictionary;
And calculating the average value of the word vectors of all Chinese characters contained in each Chinese character combination in at least two Chinese character combinations according to the word vector of each Chinese character in the at least two Chinese character combinations to obtain the average word vector of the at least two Chinese character combinations.
Preferably, the average value of the word vectors of all Chinese characters contained in the Chinese character combination can be calculated by adopting a calculation method of an arithmetic average value.
Further, the arithmetic mean is calculated as follows:
wherein a is 1 To a n The character vector of each Chinese character in a certain Chinese character combination is represented, n represents the number of the Chinese characters in the Chinese character combination, and W represents the average character vector of the Chinese character combination.
S3, acquiring a target industry category to which the organization name belongs, acquiring an industry category set containing the target industry category, and performing vector calculation on at least two industry categories containing the target industry category in the industry category set to obtain an average industry vector containing the at least two industry categories.
In this embodiment, the industry category set includes category names of a plurality of industry categories, for example, industry categories include category names of beverages, traditional Chinese medicines, banks, communications, and the like.
In an alternative embodiment, the industry category may be preset.
In an alternative embodiment, the target industry category to which the organization name pertains may be determined based on a determination that the text data set contains, for example, based on a context associated with the text data set.
Preferably, in the embodiment of the present invention, a matrix WI may be constructed for each industry category of at least two industry categories including the target industry category, where the matrix WI includes n rows and m columns, each row is formed by an average word vector of a kanji combination of an organization name, n represents that each industry category includes n organizations, and m represents that the average word vector of the kanji combination of each organization name is m dimensions.
In the embodiment of the invention, a parameter matrix WO can be constructed, m is m rows and k columns, m represents the average word vector of Chinese character combinations of each organization name as m dimension, k represents the number of industry categories, and the initial value of the matrix is randomly generated by a method for generating truncated normal distribution random numbers (such as a function truncated_normal in a deep learning framework).
Further, the truncated_normal function formula is as follows:
tf.truncated_normal(shape,mean=0.0,stddev=1.0,dtype=tf.float32)
wherein shape represents the dimension of the generated matrix, mean is the matrix parameter mean, stddev is the standard deviation of the matrix parameters, and dtype represents the type of matrix parameters.
Further, the product of the matrix WI and the matrix WO is calculated, resulting in a new matrix WN (n x k).
Defining new vectors of Chinese character combinations of each action organization name of the matrix WN (n x k), calculating arithmetic average value of the new vectors of Chinese character combinations of all organization names in the matrix, and obtaining industry average vector of the industry.
Further, in another embodiment of the present invention, the obtaining the target industry category to which the organization name belongs, and obtaining the industry category set including the target industry category, includes:
forward and backward encoding is carried out on the text data set through a bidirectional LSTM network based on an attention mechanism, and vectors generated by the forward encoding and the backward encoding are spliced together to form a spliced vector;
inputting the spliced vector into a pre-constructed first industry classification model, and determining that the industry class output by the industry classification neural network model is the target industry class to which the organization name belongs;
classifying the Chinese character combination candidate set by using a pre-constructed second industry classification model to obtain a classification result, wherein the classification result comprises industry categories corresponding to Chinese character combinations contained in the Chinese character combination candidate set;
And combining different industry categories in the classification result with the target industry category to obtain an industry category set containing the target industry category.
In this embodiment, the target industry category to which the organization structure name belongs is determined according to the content of the text data set, so that the accuracy of the obtained target industry category can be improved.
In this embodiment, the bidirectional LSTM network based on the attention mechanism includes: input layer, embedded layer, LSTM layer, attention layer, output layer.
In the embodiment, a text data set is input in an input layer to obtain an input text set, the input text set is converted into a word vector in an embedded layer, and the word vector is subjected to state processing in an LSTM layer to obtain a forward coding vector and a backward coding vector of a state click vector; in the Attention layer, the state points are subjected to an Attention mechanism in deep learning to obtain the weight of each state point, the weight is multiplied by the forward coding vector of the word vector and the backward coding vector of the word vector respectively, and the spliced vectors of the two are output at the output layer.
In this embodiment, the first industry classification model may be a multi-layer neural network model, where the multi-layer neural network includes an input layer, a hidden layer, and an output layer.
Further, the spliced vector is input into an input layer of the multi-layer neural network, a non-linear function is realized by utilizing a neuron activation function at a hidden layer of the multi-layer neural network, an industry category is output at an output layer of the multi-layer neural network, and the industry category is determined to be a target industry category to which the organization name belongs.
Further, in another embodiment of the present invention, the obtaining the target industry category to which the organization name belongs, and obtaining the industry category set including the target industry category, includes:
acquiring a supplementary Chinese character combination candidate set, wherein the supplementary Chinese character combination candidate set comprises a supplementary organization name;
classifying the organization names by using a pre-constructed third industry classification model to obtain target industry categories to which the organization names belong;
classifying the complementary Chinese character combination candidate set by using the third industry classification model to obtain a classification result, wherein the classification result comprises industry categories corresponding to the names of complementary organization mechanisms contained in the complementary Chinese character combination candidate set;
and combining different industry categories in the classification result with the target industry category to obtain an industry category set containing the target industry category.
In this embodiment, the complementary organization name may be the name text of the registered or registered enterprises and institutions. For example: the company name that a certain city has registered.
In this embodiment, a set of company names registered in a city at present can be obtained from a government data open platform, and the set is a complementary candidate set of Chinese character combinations.
In this embodiment, the third industry classification model and the second industry classification model may be the same or different models.
Further, in the embodiment of the invention, according to the categories in the national economy industry classification table, a third industry classification model constructed by utilizing the neural network model is trained; and classifying the organization names and the complementary organization names contained in the complementary Chinese character combination candidate set through a third industry classification model obtained through training.
In the embodiment of the invention, the construction of the third industry classification model can be performed by adopting a machine learning (such as decision trees, random forests and the like) or a deep learning-based method (such as convolutional neural networks and cyclic neural networks).
For example, a complementary Chinese character combination candidate set can be used as a training set, an organization name text set marked with industry categories is used as a label set, a third industry classification model based on a convolutional neural network comprising a convolutional layer, a pooling layer and a full-connection layer is constructed, the training set is input into a first layer of convolutional layer, and a convolutional data set is obtained after the first layer of convolutional layer performs depth separable convolutional operation and is input into the first layer of pooling layer;
The first layer pooling layer performs maximized pooling operation on the convolution data set to obtain a reduced-data set, and the reduced-data set is input to a second layer convolution layer, and the second layer convolution layer performs the depth separable convolution operation and then is input to the second layer pooling layer to perform maximized pooling operation until the reduced-data set is finally input to a full-connection layer;
and the full-connection layer is combined with an activation function to calculate the training value, the training value is input into a pre-constructed loss function, the loss function is calculated to obtain a loss value based on the tag set and the training value, a gradient descent algorithm is utilized to solve the minimum value of the loss function, and the third industry classification model exits training until the loss value reaches the minimum value.
S4, calculating first similarity of an average word vector of each Chinese character combination in the at least two Chinese character combinations and an average industry vector of the target industry class to obtain a first-level Chinese character combination candidate set, wherein the first-level Chinese character combination candidate set comprises the at least two Chinese character combinations and the first similarity respectively corresponding to the at least two Chinese character combinations.
In detail, the similarity calculation method has various forms, such as a euclidean distance method, a cosine distance method, and the like.
Preferably, the calculating the first similarity between the average word vector of each of the at least two chinese character combinations and the average industry vector of the target industry category includes:
calculating a first similarity between an average word vector of each Chinese character combination of the at least two Chinese character combinations and an average industry vector of the target industry category through a similarity calculation function, wherein the similarity calculation function is as follows:
wherein sim (x i ,y i ) Representing the first similarity, x i Average word vector, y, representing Chinese character combinations i An average industry vector representing the target industry category, n representing a vector dimension of the average word vector or the average industry vector.
S5, carrying out cluster calculation on the industry class set to obtain an industry class cluster set, acquiring a target cluster class to which the target industry class belongs from the industry class cluster set, and calculating an average cluster class vector of the target cluster class.
In this embodiment, a target cluster class to which a target industry class belongs is obtained from an industry class cluster set, and an arithmetic average value of all industry average vectors contained in the target cluster class is calculated, so that an average cluster class vector of the target cluster class can be obtained.
In detail, in an optional embodiment of the present invention, the performing cluster computation on the industry category set to obtain an industry category cluster set includes:
calculating the distance between any two industry categories in the industry category set, and combining the two industry categories with the minimum distance to obtain a cluster;
and circularly calculating the distance between any two industry categories in the non-combined industry categories, combining the two industry categories with the smallest distance to obtain new cluster categories until the number of the cluster categories reaches the preset number, and determining that all the obtained cluster categories form an industry category cluster set.
In the embodiment of the invention, each industry is initially regarded as an independent cluster by using a hierarchical clustering algorithm, and then two clusters closest to each other are combined, and the process is repeated until the number of the clusters reaches the preset number.
In this embodiment, the distance between two clusters is the distance between the average industry vectors of two industries that are closest to each other and are included in the two clusters, and the formula is as follows:
wherein C1, C2 represent two cluster types, DIST (C1, C2) represents the distance between the two cluster types C1, C2, P i Average industry vector, P, representing industries contained in C1 cluster class j Average industry vector representing industries contained in a C2 cluster class, dist (P i ,P j ) Representing P i And P j The distance may be calculated using a euclidean distance formula.
S6, calculating the second similarity of the average word vector of each Chinese character combination in the at least two Chinese character combinations and the average cluster vector of the target cluster to obtain a second-level Chinese character combination candidate set, wherein the second-level Chinese character combination candidate set comprises at least two Chinese character combinations and the second similarity respectively corresponding to the at least two Chinese character combinations.
In detail, there are various methods for calculating the similarity, such as euclidean distance method, cosine distance method, and the like. Preferably, the invention can adopt a multidimensional cosine similarity calculation method, wherein the multidimensional cosine similarity calculation method is as follows:
wherein sim (x i ,z i ) Representing a second degree of similarity, x i Representing the average word vector, z, of Chinese character combinations i And n represents the vector dimension of the average word vector of the Chinese character combination or the average cluster vector of the target cluster.
And S7, carrying out weight calculation on the first similarity contained in the first-level Chinese character combination candidate set and the second similarity contained in the second-level Chinese character combination candidate set to obtain a Chinese character combination score result set.
In this embodiment, weight calculation is performed on a first similarity included in a first-level chinese character combination candidate set and a second similarity included in a second-level chinese character combination candidate set, that is, weight calculation is performed on a first similarity of each chinese character combination in at least two chinese character combinations in the first-level chinese character combination candidate set and a second similarity of each chinese character combination in the second-level chinese character combination candidate set, where the second similarity corresponds to each chinese character combination.
For example, a first similarity corresponding to the combination of the Chinese characters in the first level Chinese character combination candidate set and a second similarity corresponding to the combination of the Chinese characters in the second level Chinese character combination candidate set are subjected to weight calculation, and a first similarity corresponding to the combination of the Chinese characters in the first level Chinese character combination candidate set and a second similarity corresponding to the combination of the Chinese characters in the second level Chinese character combination candidate set are subjected to weight calculation.
In detail, the step of performing weight calculation on the first similarity included in the first-level kanji combination candidate set and the second similarity included in the second-level kanji combination candidate set to obtain a kanji combination score result set includes:
multiplying a plurality of first similarities contained in the first-level Chinese character combination candidate set and a plurality of second similarities contained in the second-level Chinese character combination candidate set by the same or different weights respectively to obtain a first weight similarity set and a second weight similarity set;
And correspondingly adding a plurality of first weight similarities contained in the first weight similarity set and a plurality of second weight similarities contained in the second weight similarity set respectively to obtain a Chinese character combination scoring result set.
In this embodiment, the weights may be preset, and the sum of all the weights is 1.
In this embodiment, when the plurality of first weight similarities included in the first weight similarity set and the plurality of second weight similarities included in the second weight similarity set are respectively and correspondingly added, the first weight similarities and the second weight similarities corresponding to the same Chinese character combination are correspondingly added in the first weight similarity set and the second weight similarity set, so as to obtain the scoring result of the plurality of same Chinese character combinations.
For example, if the first similarity of the a Chinese character combination corresponding to the first level Chinese character combination candidate set is p, the first similarity of the B Chinese character combination corresponding to the first level Chinese character combination candidate set is m, the second similarity of the a Chinese character combination corresponding to the second level Chinese character combination candidate set is q, and the second similarity of the B Chinese character combination corresponding to the second level Chinese character combination candidate set is n; the first similarity is multiplied by a weight a, the second similarity is multiplied by a weight b (b=1-a), so that a first weight similarity set contains ap and am, a second weight similarity set contains bq and bn, and a Chinese character combination score result set contains ap+bq and am+bn.
S8, determining the Chinese character combination corresponding to the highest score in the Chinese character combination score result set as the Chinese character combination of the organization name.
In this embodiment, after determining that the chinese character combination corresponding to the highest score in the chinese character combination score result set is the chinese character combination of the organization name, it may be determined whether the text to be recognized extracted in step S1 is the chinese character combination, if yes, the text to be recognized of the organization name is output, and if not, the text to be recognized is replaced with the chinese character combination and output.
In the embodiment of the invention, a text to be identified of an organization name is obtained, and an average word vector of Chinese character combinations corresponding to pinyin of the text to be identified is calculated; obtaining a target industry class to which an organization name belongs and a target cluster class to which the target industry class belongs; calculating the first similarity between the average character vector of each Chinese character combination in the Chinese character combinations and the average industry vector of the target industry class; and calculating the second similarity of the average word vector of each Chinese character combination in the Chinese character combinations and the average cluster vector of the target cluster, carrying out weight calculation on the first similarities and the second similarities, and determining the Chinese character combination with the highest weight calculation score as the Chinese character combination of the organization name. The most accurate word combination can be selected from a plurality of possible word combinations of the organization, thereby achieving the purpose of improving the accuracy of identifying the organization names existing in the voice information.
As shown in fig. 2, a functional block diagram of the identification device of the characteristic information of the present invention is shown.
The identification device 100 for characteristic information according to the present invention may be installed in an electronic apparatus. The recognition device of the feature information may include a text recognition module 101, a word vector calculation module 102, an industry vector calculation module 103, a first similarity calculation module 104, a cluster vector calculation module 105, a second similarity calculation module 106, a weight calculation module 107, and a determination module 108 according to the implemented functions. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the text recognition module 101 is configured to receive a text data set obtained by speech recognition, and extract a text to be recognized of an organization name from the text data set using a named entity recognition technique.
In detail, the text data set is a text set obtained by converting a piece of audio into text through a voice recognition technology.
For example, the distribution meeting content recorded by the news distribution meeting time recorder recording device is converted into text content, and the text content is a text data set through a voice recognition technology.
In this embodiment, an entity (e.g., a person name, a place name, an organization name, a proper noun, etc.) having a specific meaning in the text is identified by a named entity recognition technique (Named Entity Recognition, simply called NER).
For example, the text "Xiaoming in Hawaii vacation, xiaoqiang visited Disney corporation" is identified by using a named entity identification technique, and the identification results obtained by the named entity identification technique are "Xiaoming-person name", "Hawaii-place name", "Disney corporation-organization structure name".
The word vector calculation module 102 is configured to summarize the chinese character combinations corresponding to the pinyin of the text to be recognized, obtain a chinese character combination candidate set, and perform vector calculation on at least two chinese character combinations in the chinese character combination candidate set, so as to obtain an average word vector of the at least two chinese character combinations.
In the embodiment of the invention, the pinyin of the text to be recognized is obtained according to the text to be recognized. For example: the text to be identified is the pinyin corresponding to the 'Xiaoyang' company, which is 'xi, ooy ǐ g ō ngs ī'.
In detail, the candidate set of Chinese character combinations refers to a set of different Chinese character combinations corresponding to pinyin of a text to be recognized. Because of the presence of homophones, the pinyin of the text to be recognized may correspond to different combinations of chinese characters,
For example, the text to be identified is "xiaoyi company", the corresponding pinyin is "xiyian oy ǐ g ō ngs ī", and the Chinese character combination corresponding to "xiyian ǐ g ō ngs ī" is not only "xiaoyi company", but also "xiaoyi company", "xiaohao company" and other Chinese character combinations due to homophones, and all Chinese character combinations corresponding to the pinyin of "xiyian ǐ g ō ngs ī" are summarized to obtain the Chinese character combination candidate set.
In this embodiment, all possible Chinese character combinations are summarized together to obtain a Chinese character combination candidate set, and then according to other modules, the most accurate combination is selected from the Chinese character combination candidate set, so that the accuracy of recognition can be improved.
Further, in an optional embodiment of the present invention, the vector calculation module performs vector calculation on at least two chinese character combinations in the chinese character combination candidate set to obtain an average word vector of the at least two chinese character combinations, including:
acquiring the word vector of each Chinese character in at least two Chinese character combinations contained in the Chinese character combination candidate set by utilizing a pre-trained word vector dictionary;
and calculating the average value of the word vectors of all Chinese characters contained in each Chinese character combination in at least two Chinese character combinations according to the word vector of each Chinese character in the at least two Chinese character combinations to obtain the average word vector of the at least two Chinese character combinations.
Preferably, the average value of the word vectors of all Chinese characters contained in the Chinese character combination can be calculated by adopting a calculation method of an arithmetic average value.
Further, the arithmetic mean is calculated as follows:
wherein a is 1 To a n The character vector of each Chinese character in a certain Chinese character combination is represented, n represents the number of the Chinese characters in the Chinese character combination, and W represents the average character vector of the Chinese character combination.
The industry vector calculation module 103 is configured to obtain a target industry category to which the organization name belongs, obtain an industry category set including the target industry category, and perform vector calculation on at least two industry categories including the target industry category in the industry category set to obtain an average industry vector including the at least two industry categories.
In this embodiment, the industry category set includes category names of a plurality of industry categories, for example, industry categories include category names of beverages, traditional Chinese medicines, banks, communications, and the like.
In an alternative embodiment, the industry category may be preset.
In an alternative embodiment, the target industry category to which the organization name pertains may be determined based on a determination that the text data set contains, for example, based on a context associated with the text data set.
Preferably, in the embodiment of the present invention, a matrix WI may be constructed for each industry category of at least two industry categories including the target industry category, where the matrix WI includes n rows and m columns, each row is formed by an average word vector of a kanji combination of an organization name, n represents that each industry category includes n organizations, and m represents that the average word vector of the kanji combination of each organization name is m dimensions.
In the embodiment of the invention, a parameter matrix WO can be constructed, m is m rows and k columns, m represents the average word vector of Chinese character combinations of each organization name as m dimension, k represents the number of industry categories, and the initial value of the matrix is randomly generated by a method for generating truncated normal distribution random numbers (such as a function truncated_normal in a deep learning framework).
Further, the truncated_normal function formula is as follows:
tf.truncated_normal(shape,mean=0.0,stddev=1.0,dtype=tf.float32)
wherein shape represents the dimension of the generated matrix, mean is the matrix parameter mean, stddev is the standard deviation of the matrix parameters, and dtype represents the type of matrix parameters.
Further, the product of the matrix WI and the matrix WO is calculated, resulting in a new matrix WN (n x k).
Defining new vectors of Chinese character combinations of each action organization name of the matrix WN (n x k), calculating arithmetic average value of the new vectors of Chinese character combinations of all organization names in the matrix, and obtaining industry average vector of the industry.
Further, in another embodiment of the present invention, the industry vector calculation module obtains a target industry category to which the organization name belongs, and obtains an industry category set including the target industry category, including:
forward and backward encoding is carried out on the text data set through a bidirectional LSTM network based on an attention mechanism, and vectors generated by the forward encoding and the backward encoding are spliced together to form a spliced vector;
inputting the spliced vector into a pre-constructed first industry classification model, and determining that the industry class output by the industry classification neural network model is the target industry class to which the organization name belongs;
classifying the Chinese character combination candidate set by using a pre-constructed second industry classification model to obtain a classification result, wherein the classification result comprises industry categories corresponding to Chinese character combinations contained in the Chinese character combination candidate set;
and combining different industry categories in the classification result with the target industry category to obtain an industry category set containing the target industry category.
In this embodiment, the target industry category to which the organization structure name belongs is determined according to the content of the text data set, so that the accuracy of the obtained target industry category can be improved.
In this embodiment, the bidirectional LSTM network based on the attention mechanism includes: input layer, embedded layer, LSTM layer, attention layer, output layer.
In the embodiment, a text data set is input in an input layer to obtain an input text set, the input text set is converted into a word vector in an embedded layer, and the word vector is subjected to state processing in an LSTM layer to obtain a forward coding vector and a backward coding vector of a state click vector; in the Attention layer, the state points are subjected to an Attention mechanism in deep learning to obtain the weight of each state point, the weight is multiplied by the forward coding vector of the word vector and the backward coding vector of the word vector respectively, and the spliced vectors of the two are output at the output layer.
In this embodiment, the first industry classification model may be a multi-layer neural network model, where the multi-layer neural network includes an input layer, a hidden layer, and an output layer.
Further, the spliced vector is input into an input layer of the multi-layer neural network, a non-linear function is realized by utilizing a neuron activation function at a hidden layer of the multi-layer neural network, an industry category is output at an output layer of the multi-layer neural network, and the industry category is determined to be a target industry category to which the organization name belongs.
Further, in another embodiment of the present invention, the industry vector calculation module obtains a target industry category to which the organization name belongs, and obtains an industry category set including the target industry category, including:
acquiring a supplementary Chinese character combination candidate set, wherein the supplementary Chinese character combination candidate set comprises a supplementary organization name;
classifying the organization names by using a pre-constructed third industry classification model to obtain target industry categories to which the organization names belong;
classifying the complementary Chinese character combination candidate set by using the third industry classification model to obtain a classification result, wherein the classification result comprises industry categories corresponding to the names of complementary organization mechanisms contained in the complementary Chinese character combination candidate set;
and combining different industry categories in the classification result with the target industry category to obtain an industry category set containing the target industry category.
In this embodiment, the complementary organization name may be the name text of the registered or registered enterprises and institutions. For example: the company name that a certain city has registered.
In this embodiment, a set of company names registered in a city at present can be obtained from a government data open platform, and the set is a complementary candidate set of Chinese character combinations.
In this embodiment, the third industry classification model and the second industry classification model may be the same or different models.
Further, in the embodiment of the invention, according to the categories in the national economy industry classification table, a third industry classification model constructed by utilizing the neural network model is trained; and classifying the organization names and the complementary organization names contained in the complementary Chinese character combination candidate set through a third industry classification model obtained through training.
In the embodiment of the invention, the construction of the third industry classification model can be performed by adopting a machine learning (such as decision trees, random forests and the like) or a deep learning-based method (such as convolutional neural networks and cyclic neural networks).
For example, a complementary Chinese character combination candidate set can be used as a training set, an organization name text set marked with industry categories is used as a label set, a third industry classification model based on a convolutional neural network comprising a convolutional layer, a pooling layer and a full-connection layer is constructed, the training set is input into a first layer of convolutional layer, and a convolutional data set is obtained after the first layer of convolutional layer performs depth separable convolutional operation and is input into the first layer of pooling layer;
The first layer pooling layer performs maximized pooling operation on the convolution data set to obtain a reduced-data set, and the reduced-data set is input to a second layer convolution layer, and the second layer convolution layer performs the depth separable convolution operation and then is input to the second layer pooling layer to perform maximized pooling operation until the reduced-data set is finally input to a full-connection layer;
and the full-connection layer is combined with an activation function to calculate the training value, the training value is input into a pre-constructed loss function, the loss function is calculated to obtain a loss value based on the tag set and the training value, a gradient descent algorithm is utilized to solve the minimum value of the loss function, and the third industry classification model exits training until the loss value reaches the minimum value.
The first similarity calculation module 104 is configured to calculate a first similarity between an average word vector of each of the at least two chinese character combinations and an average industry vector of the target industry category, to obtain a first-level chinese character combination candidate set, where the first-level chinese character combination candidate set includes the at least two chinese character combinations and first similarities corresponding to the at least two chinese character combinations respectively.
In detail, the similarity calculation method has various forms, such as a euclidean distance method, a cosine distance method, and the like.
Preferably, the first similarity calculating module calculates a first similarity between an average word vector of each of the at least two chinese character combinations and an average industry vector of the target industry category, including:
calculating a first similarity between an average word vector of each Chinese character combination of the at least two Chinese character combinations and an average industry vector of the target industry category through a similarity calculation function, wherein the similarity calculation function is as follows:
wherein sim (x i ,y i ) Representing the first similarity, x i Average word vector, y, representing Chinese character combinations i An average industry vector representing the target industry category, n representing a vector dimension of the average word vector or the average industry vector.
The cluster vector calculation module 105 is configured to perform cluster calculation on the industry class set to obtain an industry class cluster set, obtain a target cluster class to which the target industry class belongs from the industry class cluster set, and calculate an average cluster vector of the target cluster class.
In this embodiment, a target cluster class to which a target industry class belongs is obtained from an industry class cluster set, and an arithmetic average value of all industry average vectors contained in the target cluster class is calculated, so that an average cluster class vector of the target cluster class can be obtained.
In detail, in an optional embodiment of the present invention, the cluster class vector calculation module performs cluster calculation on the industry class set to obtain an industry class cluster set, and includes:
calculating the distance between any two industry categories in the industry category set, and combining the two industry categories with the minimum distance to obtain a cluster;
and circularly calculating the distance between any two industry categories in the non-combined industry categories, combining the two industry categories with the smallest distance to obtain new cluster categories until the number of the cluster categories reaches the preset number, and determining that all the obtained cluster categories form an industry category cluster set.
In the embodiment of the invention, each industry is initially regarded as an independent cluster by using a hierarchical clustering algorithm, and then two clusters closest to each other are combined, and the process is repeated until the number of the clusters reaches the preset number.
In this embodiment, the distance between two clusters is the distance between the average industry vectors of two industries that are closest to each other and are included in the two clusters, and the formula is as follows:
wherein C1, C2 represent two cluster types, DIST (C1, C2) represents the distance between the two cluster types C1, C2, P i Average industry vector, P, representing industries contained in C1 cluster class j Average industry vector representing industries contained in a C2 cluster class, dist (P i ,P j ) Representing P i And P j The distance may be calculated using a euclidean distance formula.
And a second similarity calculating module 106, configured to calculate a second similarity between an average word vector of each of the at least two chinese character combinations and an average cluster vector of the target cluster, to obtain a second-level chinese character combination candidate set, where the second-level chinese character combination candidate set includes the at least two chinese character combinations and second similarities corresponding to the at least two chinese character combinations, respectively.
In detail, there are various methods for calculating the similarity, such as euclidean distance method, cosine distance method, and the like. Preferably, the invention can adopt a multidimensional cosine similarity calculation method, wherein the multidimensional cosine similarity calculation method is as follows:
/>
wherein sim (x i ,z i ) Representing a second degree of similarity, x i Representing the average word vector, z, of Chinese character combinations i And n represents the vector dimension of the average word vector of the Chinese character combination or the average cluster vector of the target cluster.
The weight calculation module 107 is configured to perform weight calculation on a first similarity included in the first-level kanji combination candidate set and a second similarity included in the second-level kanji combination candidate set, so as to obtain a kanji combination score result set.
In this embodiment, weight calculation is performed on a first similarity included in a first-level chinese character combination candidate set and a second similarity included in a second-level chinese character combination candidate set, that is, weight calculation is performed on a first similarity of each chinese character combination in at least two chinese character combinations in the first-level chinese character combination candidate set and a second similarity of each chinese character combination in the second-level chinese character combination candidate set, where the second similarity corresponds to each chinese character combination.
For example, a first similarity corresponding to the combination of the Chinese characters in the first level Chinese character combination candidate set and a second similarity corresponding to the combination of the Chinese characters in the second level Chinese character combination candidate set are subjected to weight calculation, and a first similarity corresponding to the combination of the Chinese characters in the first level Chinese character combination candidate set and a second similarity corresponding to the combination of the Chinese characters in the second level Chinese character combination candidate set are subjected to weight calculation.
In detail, the weight calculation module 107 is specifically configured to:
multiplying a plurality of first similarities contained in the first-level Chinese character combination candidate set and a plurality of second similarities contained in the second-level Chinese character combination candidate set by the same or different weights respectively to obtain a first weight similarity set and a second weight similarity set;
And correspondingly adding a plurality of first weight similarities contained in the first weight similarity set and a plurality of second weight similarities contained in the second weight similarity set respectively to obtain a Chinese character combination scoring result set.
In this embodiment, the weights may be preset, and the sum of all the weights is 1.
In this embodiment, when the plurality of first weight similarities included in the first weight similarity set and the plurality of second weight similarities included in the second weight similarity set are respectively and correspondingly added, the first weight similarities and the second weight similarities corresponding to the same Chinese character combination are correspondingly added in the first weight similarity set and the second weight similarity set, so as to obtain the scoring result of the plurality of same Chinese character combinations.
For example, if the first similarity of the a Chinese character combination corresponding to the first level Chinese character combination candidate set is p, the first similarity of the B Chinese character combination corresponding to the first level Chinese character combination candidate set is m, the second similarity of the a Chinese character combination corresponding to the second level Chinese character combination candidate set is q, and the second similarity of the B Chinese character combination corresponding to the second level Chinese character combination candidate set is n; the first similarity is multiplied by a weight a, the second similarity is multiplied by a weight b (b=1-a), so that a first weight similarity set contains ap and am, a second weight similarity set contains bq and bn, and a Chinese character combination score result set contains ap+bq and am+bn.
And the determining module 108 is configured to determine that the chinese character combination corresponding to the highest score in the chinese character combination score result set is the chinese character combination of the organization name.
In this embodiment, after determining that the chinese character combination corresponding to the highest score in the chinese character combination score result set is the chinese character combination of the organization name, it may be determined whether the text to be recognized extracted in the text recognition module 101 is the chinese character combination, if so, the text to be recognized of the organization name is output, and if not, the text to be recognized is replaced with the chinese character combination and output.
In the embodiment of the invention, a text to be identified of an organization name is obtained, and an average word vector of Chinese character combinations corresponding to pinyin of the text to be identified is calculated; obtaining a target industry class to which an organization name belongs and a target cluster class to which the target industry class belongs; calculating the first similarity between the average character vector of each Chinese character combination in the Chinese character combinations and the average industry vector of the target industry class; and calculating the second similarity of the average word vector of each Chinese character combination in the Chinese character combinations and the average cluster vector of the target cluster, carrying out weight calculation on the first similarities and the second similarities, and determining the Chinese character combination with the highest weight calculation score as the Chinese character combination of the organization name. The most accurate word combination can be selected from a plurality of possible word combinations of the organization, thereby achieving the purpose of improving the accuracy of identifying the organization names existing in the voice information.
Fig. 3 is a schematic structural diagram of an electronic device implementing the feature information identification method according to the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as a code for identification of characteristic information, etc., but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective parts of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., identification programs of feature information, etc.) stored in the memory 11, and calling data stored in the memory 11.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 3 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The identification program 12 of the characteristic information stored in the memory 11 in the electronic device 1 is a combination of a plurality of instructions, which when executed in the processor 10, can implement:
receiving a text data set obtained through voice recognition, and extracting texts to be recognized of organization names from the text data set by using a named entity recognition technology;
summarizing the Chinese character combinations corresponding to the pinyin of the text to be identified to obtain a Chinese character combination candidate set, and carrying out vector calculation on at least two Chinese character combinations in the Chinese character combination candidate set to obtain average word vectors of the at least two Chinese character combinations;
Obtaining a target industry category to which the organization name belongs, obtaining an industry category set containing the target industry category, and performing vector calculation on at least two industry categories containing the target industry category in the industry category set to obtain an average industry vector containing the at least two industry categories;
calculating first similarity of an average word vector of each Chinese character combination in the at least two Chinese character combinations and an average industry vector of the target industry class to obtain a first-level Chinese character combination candidate set, wherein the first-level Chinese character combination candidate set comprises the at least two Chinese character combinations and the first similarity respectively corresponding to the at least two Chinese character combinations;
performing cluster calculation on the industry class set to obtain an industry class cluster set, acquiring a target cluster class to which the target industry class belongs from the industry class cluster set, and calculating an average cluster class vector of the target cluster class;
calculating the second similarity of the average word vector of each Chinese character combination in the at least two Chinese character combinations and the average cluster vector of the target cluster to obtain a second-level Chinese character combination candidate set, wherein the second-level Chinese character combination candidate set comprises at least the two Chinese character combinations and the second similarity respectively corresponding to the at least two Chinese character combinations;
Performing weight calculation on the first similarity contained in the first-level Chinese character combination candidate set and the second similarity contained in the second-level Chinese character combination candidate set to obtain a Chinese character combination score result set;
and determining the Chinese character combination corresponding to the highest score in the Chinese character combination scoring result set as the Chinese character combination of the organization name.
Specifically, the specific implementation method of the above instructions by the processor 10 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (8)

1. The characteristic information identification method is characterized by being applied to electronic equipment and comprising the following steps of:
receiving a text data set obtained through voice recognition, and extracting texts to be recognized of organization names from the text data set by using a named entity recognition technology;
summarizing the Chinese character combinations corresponding to the pinyin of the text to be identified to obtain a Chinese character combination candidate set, and carrying out vector calculation on at least two Chinese character combinations in the Chinese character combination candidate set to obtain average word vectors of the at least two Chinese character combinations; obtaining a target industry category to which the organization name belongs, obtaining an industry category set containing the target industry category, and performing vector calculation on at least two industry categories containing the target industry category in the industry category set to obtain an average industry vector containing the at least two industry categories;
Calculating first similarity of an average word vector of each Chinese character combination in the at least two Chinese character combinations and an average industry vector of the target industry class to obtain a first-level Chinese character combination candidate set, wherein the first-level Chinese character combination candidate set comprises the at least two Chinese character combinations and the first similarity respectively corresponding to the at least two Chinese character combinations;
performing cluster calculation on the industry class set to obtain an industry class cluster set, acquiring a target cluster class to which the target industry class belongs from the industry class cluster set, and calculating an average cluster class vector of the target cluster class;
calculating the second similarity of the average word vector of each Chinese character combination in the at least two Chinese character combinations and the average cluster vector of the target cluster to obtain a second-level Chinese character combination candidate set, wherein the second-level Chinese character combination candidate set comprises at least the two Chinese character combinations and the second similarity respectively corresponding to the at least two Chinese character combinations;
performing weight calculation on the first similarity contained in the first-level Chinese character combination candidate set and the second similarity contained in the second-level Chinese character combination candidate set to obtain a Chinese character combination score result set;
Determining the Chinese character combination corresponding to the highest score in the Chinese character combination score result set as the Chinese character combination of the organization name; the cluster calculation is performed on the industry class set to obtain an industry class cluster set, which comprises the following steps: calculating the distance between any two industry categories in the industry category set, and combining the two industry categories with the minimum distance to obtain a cluster; circularly calculating the distance between any two industry categories in the non-combined industry categories, combining the two industry categories with the smallest distance to obtain new cluster categories until the number of the cluster categories reaches a preset number, and determining that all the obtained cluster categories form an industry category cluster set;
the step of performing weight calculation on the first similarity included in the first-level Chinese character combination candidate set and the second similarity included in the second-level Chinese character combination candidate set to obtain a Chinese character combination score result set includes: multiplying a plurality of first similarities contained in the first-level Chinese character combination candidate set and a plurality of second similarities contained in the second-level Chinese character combination candidate set by the same or different weights respectively to obtain a first weight similarity set and a second weight similarity set; and correspondingly adding a plurality of first weight similarities contained in the first weight similarity set and a plurality of second weight similarities contained in the second weight similarity set respectively to obtain a Chinese character combination scoring result set.
2. The method for identifying feature information according to claim 1, wherein the vector calculation is performed on at least two chinese character combinations in the candidate set of chinese character combinations to obtain an average word vector of the at least two chinese character combinations, including: acquiring the word vector of each Chinese character in at least two Chinese character combinations contained in the Chinese character combination candidate set by utilizing a pre-trained word vector dictionary;
and calculating the average value of the word vectors of all Chinese characters contained in each Chinese character combination in at least two Chinese character combinations according to the word vector of each Chinese character in the at least two Chinese character combinations to obtain the average word vector of the at least two Chinese character combinations.
3. The method for identifying feature information according to claim 1, wherein the obtaining a target industry category to which the organization name belongs, and obtaining an industry category set including the target industry category, includes:
forward and backward encoding is carried out on the text data set through a bidirectional LSTM network based on an attention mechanism, and vectors generated by the forward encoding and the backward encoding are spliced together to form a spliced vector;
inputting the spliced vector into a pre-constructed first industry classification model, and determining that the industry class output by the industry classification neural network model is the target industry class to which the organization name belongs;
Classifying the Chinese character combination candidate set by using a pre-constructed second industry classification model to obtain a classification result, wherein the classification result comprises industry categories corresponding to Chinese character combinations contained in the Chinese character combination candidate set;
and combining different industry categories in the classification result with the target industry category to obtain an industry category set containing the target industry category.
4. The method for identifying feature information according to claim 1, wherein the obtaining a target industry category to which the organization name belongs, and obtaining an industry category set including the target industry category, includes:
acquiring a supplementary Chinese character combination candidate set, wherein the supplementary Chinese character combination candidate set comprises a supplementary organization name;
classifying the organization names by using a pre-constructed third industry classification model to obtain target industry categories to which the organization names belong;
classifying the complementary Chinese character combination candidate set by using the third industry classification model to obtain a classification result, wherein the classification result comprises industry categories corresponding to the names of complementary organization mechanisms contained in the complementary Chinese character combination candidate set;
And combining different industry categories in the classification result with the target industry category to obtain an industry category set containing the target industry category.
5. The method of claim 1 to 4, wherein the calculating a first similarity of the average word vector of each of the at least two chinese character combinations to the average industry vector of the target industry category comprises:
calculating a first similarity between an average word vector of each Chinese character combination of the at least two Chinese character combinations and an average industry vector of the target industry category through a similarity calculation function, wherein the similarity calculation function is as follows:
wherein sin (x i ,y i ) Representing the first similarity, x i Average word vector, y, representing Chinese character combinations i An average industry vector representing the target industry category, n representing a vector dimension of the average word vector or the average industry vector.
6. A feature information identification apparatus for implementing the feature information identification method according to any one of claims 1 to 5, characterized in that the apparatus comprises:
the text recognition module is used for receiving a text data set obtained through voice recognition and extracting texts to be recognized of organization names from the text data set by using a named entity recognition technology;
The word vector calculation module is used for summarizing the Chinese character combinations corresponding to the pinyin of the text to be identified to obtain a Chinese character combination candidate set, and carrying out vector calculation on at least two Chinese character combinations in the Chinese character combination candidate set to obtain an average word vector of the at least two Chinese character combinations;
the industry vector calculation module is used for acquiring a target industry category to which the organization name belongs, acquiring an industry category set containing the target industry category, and carrying out vector calculation on at least two industry categories containing the target industry category in the industry category set to obtain an average industry vector containing the at least two industry categories;
the first similarity calculation module is used for calculating first similarity between an average word vector of each Chinese character combination in the at least two Chinese character combinations and an average industry vector of the target industry class to obtain a first-level Chinese character combination candidate set, wherein the first-level Chinese character combination candidate set comprises the at least two Chinese character combinations and first similarity respectively corresponding to the at least two Chinese character combinations;
the cluster vector calculation module is used for carrying out cluster calculation on the industry class set to obtain an industry class cluster set, acquiring a target cluster class to which the target industry class belongs from the industry class cluster set, and calculating an average cluster vector of the target cluster class;
A second similarity calculation module, configured to calculate a second similarity between an average word vector of each chinese character combination in the at least two chinese character combinations and an average cluster vector of the target cluster, to obtain a second-level chinese character combination candidate set, where the second-level chinese character combination candidate set includes at least two chinese character combinations and second similarities corresponding to the at least two chinese character combinations respectively;
the weight calculation module is used for carrying out weight calculation on the first similarity contained in the first-level Chinese character combination candidate set and the second similarity contained in the second-level Chinese character combination candidate set to obtain a Chinese character combination score result set; and the determining module is used for determining the Chinese character combination corresponding to the highest score in the Chinese character combination score result set as the Chinese character combination of the organization name.
7. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of identifying characteristic information according to any one of claims 1 to 5.
8. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method of identifying characteristic information according to any one of claims 1 to 5.
CN202010482841.6A 2020-05-29 2020-05-29 Feature information identification method and device and computer readable storage medium Active CN111680513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010482841.6A CN111680513B (en) 2020-05-29 2020-05-29 Feature information identification method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010482841.6A CN111680513B (en) 2020-05-29 2020-05-29 Feature information identification method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111680513A CN111680513A (en) 2020-09-18
CN111680513B true CN111680513B (en) 2024-03-29

Family

ID=72452963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010482841.6A Active CN111680513B (en) 2020-05-29 2020-05-29 Feature information identification method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111680513B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016027493A (en) * 2015-09-29 2016-02-18 株式会社東芝 Document classification support device, document classification support method, and document classification support program
WO2018153295A1 (en) * 2017-02-27 2018-08-30 腾讯科技(深圳)有限公司 Text entity extraction method, device, apparatus, and storage media
CN109165384A (en) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 A kind of name entity recognition method and device
CN109346056A (en) * 2018-09-20 2019-02-15 中国科学院自动化研究所 Phoneme synthesizing method and device based on depth measure network
CN110674813A (en) * 2019-09-24 2020-01-10 北京字节跳动网络技术有限公司 Chinese character recognition method and device, computer readable medium and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11010561B2 (en) * 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016027493A (en) * 2015-09-29 2016-02-18 株式会社東芝 Document classification support device, document classification support method, and document classification support program
WO2018153295A1 (en) * 2017-02-27 2018-08-30 腾讯科技(深圳)有限公司 Text entity extraction method, device, apparatus, and storage media
CN109165384A (en) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 A kind of name entity recognition method and device
CN109346056A (en) * 2018-09-20 2019-02-15 中国科学院自动化研究所 Phoneme synthesizing method and device based on depth measure network
CN110674813A (en) * 2019-09-24 2020-01-10 北京字节跳动网络技术有限公司 Chinese character recognition method and device, computer readable medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于注意力机制的中文短文本关键词提取模型;杨丹浩;吴岳辛;范春晓;;计算机科学;20200115(第01期);全文 *
一种基于部件CNN的网络安全命名实体识别方法;魏笑;秦永彬;陈艳平;;计算机与数字工程;20200120(第01期);全文 *

Also Published As

Publication number Publication date
CN111680513A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN112597312A (en) Text classification method and device, electronic equipment and readable storage medium
CN112560453B (en) Voice information verification method and device, electronic equipment and medium
CN112883190A (en) Text classification method and device, electronic equipment and storage medium
CN113033198B (en) Similar text pushing method and device, electronic equipment and computer storage medium
CN115392237B (en) Emotion analysis model training method, device, equipment and storage medium
CN113157927A (en) Text classification method and device, electronic equipment and readable storage medium
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN113378970A (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN113722483A (en) Topic classification method, device, equipment and storage medium
CN114398557A (en) Information recommendation method and device based on double portraits, electronic equipment and storage medium
CN113821622A (en) Answer retrieval method and device based on artificial intelligence, electronic equipment and medium
CN113344125B (en) Long text matching recognition method and device, electronic equipment and storage medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN114840684A (en) Map construction method, device and equipment based on medical entity and storage medium
CN113254814A (en) Network course video labeling method and device, electronic equipment and medium
CN116450829A (en) Medical text classification method, device, equipment and medium
CN116340516A (en) Entity relation cluster extraction method, device, equipment and storage medium
CN116341646A (en) Pretraining method and device of Bert model, electronic equipment and storage medium
CN113705201B (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
CN114943306A (en) Intention classification method, device, equipment and storage medium
CN111680513B (en) Feature information identification method and device and computer readable storage medium
CN114219367A (en) User scoring method, device, equipment and storage medium
CN114708073A (en) Intelligent detection method and device for surrounding mark and serial mark, electronic equipment and storage medium
CN114595321A (en) Question marking method and device, electronic equipment and storage medium
CN113706207A (en) Order transaction rate analysis method, device, equipment and medium based on semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant