CN115687606A - Corpus processing method and device, electronic equipment and storage medium - Google Patents

Corpus processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115687606A
CN115687606A CN202110835844.8A CN202110835844A CN115687606A CN 115687606 A CN115687606 A CN 115687606A CN 202110835844 A CN202110835844 A CN 202110835844A CN 115687606 A CN115687606 A CN 115687606A
Authority
CN
China
Prior art keywords
matrix
corpus
vectors
vector
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110835844.8A
Other languages
Chinese (zh)
Inventor
赵艺宾
雷昕
闫凡
徐敬蘅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202110835844.8A priority Critical patent/CN115687606A/en
Publication of CN115687606A publication Critical patent/CN115687606A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a corpus processing method, a corpus processing device, electronic equipment and a storage medium, wherein the corpus processing method comprises the following steps: generating a first matrix based on the corpus information; each row element of the first matrix represents a first text in the corpus information; dividing each row element of the first matrix into a first vector with a set dimension; clustering the first vectors based on the similarity between the first vectors to obtain at least one cluster; replacing the first vector in the corresponding cluster by using the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix; and inputting the second matrix into a set Natural Language Processing (NLP) model to obtain a semantic recognition result about the corpus information.

Description

Corpus processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a corpus processing method and apparatus, an electronic device, and a storage medium.
Background
In the field of Natural Language Processing (NLP) for machine learning, word vectors are a common word meaning representation, and the dimensions of the word vectors represent the features of words. And training the NLP model through the corpus information, wherein the NLP model comprises the steps of processing the corpus to obtain word vectors, and inputting the word vectors into the model for training. When the NLP model is trained through the word vectors, the larger the dimensionality of the word vectors input into the model is, the more accurately different words can be distinguished by the obtained model, but the larger the memory occupied by the model during loading is.
In the related art, in order to reduce the memory occupied by loading the model, when the corpus is processed, the word vector dimension reduction processing is usually implemented by using Principal Component Analysis (PCA), linear Discriminant Analysis (LDA), or embedded Embedding based on a deep network, and the like, which has the problems of large computational power consumption and low processing speed.
Disclosure of Invention
In view of this, embodiments of the present application provide a corpus processing method, apparatus, electronic device, and storage medium, so as to at least solve the problems of large computation consumption and low processing speed in corpus processing in the related art.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a corpus processing method, which comprises the following steps:
generating a first matrix based on the corpus information; each row element of the first matrix represents a first text in the corpus information;
dividing each row element of the first matrix into a first vector with a set dimension;
clustering the first vectors based on the similarity between the first vectors to obtain at least one cluster;
replacing the first vector in the corresponding cluster by using the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix;
and inputting the second matrix into a set NLP model to obtain a semantic recognition result about the corpus information.
In the foregoing scheme, the clustering the first vectors based on the similarity between the first vectors includes:
distributing the first vectors with the same columns occupied in the first matrix to the same first set to obtain at least one first set;
and clustering the first vectors in each first set of the at least one first set according to the similarity among the first vectors.
In the foregoing solution, the generating a first matrix based on the corpus information includes:
determining at least two first texts based on a splitting result obtained by splitting the corpus information;
and performing feature extraction on each of the determined at least two first texts to generate the first matrix.
In the foregoing solution, the determining at least two first texts based on a splitting result obtained by splitting the corpus information includes:
splitting the corpus information to obtain at least two second texts;
counting the occurrence frequency of each second text in the corpus information, and determining a counting result;
and determining at least two first texts in the at least two second texts based on the determined statistical results.
In the foregoing solution, the determining, based on the determined statistical result, at least two first texts in the at least two second texts includes:
determining the first text from at least two second texts, the number of times of occurrence of which in the corpus information meets a set threshold, based on the determined statistical result;
and/or the presence of a gas in the gas,
determining the first text in the at least two second texts by using an Inverse Document Frequency (IDF) algorithm based on the determined statistical result.
In the foregoing solution, the determining at least two first texts based on a splitting result obtained by splitting the corpus information includes:
determining at least two first texts based on a splitting result obtained by splitting the corpus information according to a set word segmentation rule; the first text represents a word determined based on a splitting result of the corpus information.
In the foregoing solution, the clustering the first vectors based on the similarity between the first vectors includes:
obtaining a similarity measurement matrix between the first vectors according to the similarity between the elements of the first vectors;
and clustering each first vector according to the similarity metric matrix.
An embodiment of the present application further provides a corpus processing apparatus, including:
the generating unit is used for generating a first matrix based on the corpus information; each row element of the first matrix represents a first text in the corpus information;
the dividing unit is used for dividing each row element of the first matrix into a first vector with a set dimension;
the clustering unit is used for clustering the first vectors based on the similarity among the first vectors to obtain at least one cluster;
the replacing unit is used for replacing the first vector in the corresponding cluster by using the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix;
and the recognition unit is used for inputting the second matrix into a set NLP model to obtain a semantic recognition result related to the corpus information.
An embodiment of the present application further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor,
and the processor is used for executing the steps of the corpus processing method when the computer program is run.
The embodiment of the present application further provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the corpus processing method are implemented.
In the embodiment of the application, a first matrix is generated based on corpus information, each row element of the first matrix is divided into first vectors with set dimensionality, each first vector is clustered based on the similarity between the vectors to obtain at least one cluster, and a cluster center corresponding to each cluster is used for replacing the first vectors in the corresponding cluster to obtain a second matrix; each row element of the first matrix represents a first text in the corpus information. In this way, the first vector of the first matrix representing the corpus information is clustered, the first vector in the corresponding cluster is replaced by the cluster center corresponding to each cluster to obtain a second matrix, and the second matrix is input to a set NLP model to obtain a semantic recognition result about the corpus information. In the embodiment of the application, the space occupied by the word vectors is reduced in a clustering mode during corpus processing, and a large amount of high-dimensionality matrix multiplication operations are not needed in the corpus compression process, so that the calculation power consumed during corpus processing can be reduced, and the corpus processing speed is increased.
Drawings
Fig. 1 is a schematic flow chart illustrating a corpus processing method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a first matrix according to an embodiment of the present application;
FIG. 3 is a diagram illustrating a corpus processing method according to an embodiment of the present application;
FIG. 4 is a diagram illustrating another corpus processing method according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a corpus processing apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the field of NLP for machine learning, word vectors are a common word meaning representation, and the dimensions of the word vectors represent the characteristics of words. And training the NLP model through corpus information, wherein the NLP model comprises the steps of processing corpus to obtain word vectors, and inputting the word vectors into the model for training. When the NLP model is trained through the word vectors, the larger the dimensionality of the word vectors input into the model is, the more accurately different words can be distinguished by the obtained model, but the larger the memory occupied by the model during loading is.
In the related art, in order to reduce the memory occupied by loading the model, when the corpus is processed, the word vector dimension reduction processing is usually realized by adopting modes such as PCA, LDA or deep network-based Embedding, and a large amount of high-dimension matrix multiplication needs to be performed, so that the problems of large consumption of computing power and low processing speed exist.
Based on this, in various embodiments of the present application, a first matrix is generated based on corpus information, each row element of the first matrix is divided into first vectors with set dimensions, each first vector is clustered based on the similarity between the vectors to obtain at least one cluster, a cluster center corresponding to each cluster is used to replace the first vector in the corresponding cluster to obtain a second matrix, and the second matrix is input to a set NLP model to obtain a semantic identification result about the corpus information; each row element of the first matrix represents a first text in the corpus information. Therefore, the first vectors of the first matrix representing the corpus information are clustered, the first vectors in the corresponding clusters are replaced by the cluster centers corresponding to each cluster, the space occupied by the word vectors is reduced in a clustering mode when the corpus is processed in the embodiment of the application, and a large number of high-dimensionality matrix multiplication operations are not needed in the corpus compression process, so that the calculation power consumed in the corpus processing can be reduced, and the corpus processing speed is improved.
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
Fig. 1 is a schematic view of an implementation flow of a corpus processing method according to an embodiment of the present application, where an execution main body may be an electronic device. As shown in fig. 1, the corpus processing method includes:
step 101: a first matrix is generated based on the corpus information.
Each row element of the first matrix represents a first text in the corpus information.
In this embodiment, the first matrix is generated based on corpus information. Here, the corpus information represents a text for semantic recognition, that is, the corpus information is a text to be semantically recognized. In practical application, the corpus information can be processed into corresponding words, sentences and/or paragraph texts according to a specific recognition scene during semantic recognition. Each line element of the first matrix represents a first text based on the corpus information, and the first text can be a word, a sentence or a paragraph text, so that vector processing of a word vector, a sentence vector or a paragraph vector corresponding to the corpus information is realized.
Step 102: each row element of the first matrix is divided into a first vector of set dimensions.
A set number of elements of each row in the first matrix is divided into a first vector. Here, every p adjacent elements may be determined as a first vector of p dimensions starting from the first element of each row in the first matrix, and then the elements of each row in the first matrix are divided into m first vectors of p dimensions. Here, p represents a set dimension of the first vector, i.e., a set number, m and p are positive integers not less than 1, and a combination of m and p may be set according to the dimension of the first matrix. The number m of the first vectors may be adjusted by feedback according to the expression effect of the corpus processing result on the model and the size of the memory occupied by loading the model, for example, if the semantic recognition result accuracy obtained by inputting the corpus processing result into the NLP model is lower than a set threshold, the value of m is increased correspondingly.
For example, the first matrix is a 12 × n-dimensional matrix, each row has 12 elements, and the combination of m and p may be set to one of combinations 1 × 12, 2 × 6, 3 × 4, 4 × 3, 6 × 2, and 12 × 1.
Step 103: and clustering the first vectors based on the similarity between the first vectors to obtain at least one cluster.
And clustering the first vectors based on the similarity between the first vectors to obtain at least one cluster. Here, clustering may be performed based on the similarity of each first vector using a clustering algorithm such as k-means. In the clustering result, the similarity between the first vectors in the same cluster may be higher than a threshold of the similarity. Here, the number of clusters may be adjusted by feedback according to the expression effect of the corpus processing result on the model and the size of the memory occupied by model loading.
<xnotran> , A (0,0,0,0,0.80,0), B (0,0,0,0,0.82,0), C (0.32,0,0,0,0,0), B (0.30,0,0,0,0,0), A B , C D . </xnotran>
Step 104: and replacing the first vector in the corresponding cluster by using the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix.
And replacing the first vector in the corresponding cluster in the first matrix by using the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix. Here, the cluster center may be in the form of an element, which is determined based on all first vectors in the cluster; the cluster center can also be in a vector form, the vector can be determined based on the average value of all first vectors in the cluster, and the vector dimension of the cluster center can also be subjected to feedback adjustment according to the expression effect of the corpus processing result on the model and the size of a memory occupied by model loading.
For example, the first matrix is
Figure BDA0003176956470000061
Each row element is divided into 1 first vector of 4 dimensions. After clustering, dividing a first vector E (0, 0.80, 0) and a first vector F (0, 0.82, 0) into a cluster, wherein the center of the corresponding cluster is (0, 0.81, 0); dividing the first vector G (0, 0.32) and the first vector H (0, 0.30) into a cluster, wherein the corresponding cluster center is (0, 0.31), and replacing the first vector in the corresponding cluster in the first matrix based on the corresponding cluster center of each cluster obtained by clustering to obtain a second matrix
Figure BDA0003176956470000071
As another example, the first matrix is
Figure BDA0003176956470000072
Each row element is divided into 1 first vector of 4 dimensions. After clustering, dividing the first vector E (0, 0.80, 0) and the first vector F (0, 0.82, 0) into a cluster, wherein the center of the corresponding cluster is 0.70; dividing the first vector G (0, 0.22) and the first vector H (0, 0.20) into a cluster, wherein the corresponding cluster center is an element 0.80, and replacing the first vector in the corresponding cluster in the first matrix based on the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix
Figure BDA0003176956470000073
Step 105: and inputting the second matrix into a set NLP model to obtain a semantic recognition result about the corpus information.
And inputting the second matrix corresponding to the corpus information into the set NLP model to obtain a semantic recognition result about the corpus information. Here, depending on the set NLP model, functions such as semantic analysis and emotion classification can be realized. And the processed second matrix is utilized, so that the memory occupied by the model can be reduced when the NLP model with different functions is loaded.
In the embodiment of the application, clustering is performed on a first vector of a first matrix representing corpus information, a first vector in a corresponding cluster is replaced by a cluster center corresponding to each cluster to obtain a second matrix, and the second matrix is input to a set NLP model to obtain a semantic recognition result related to the corpus information. In the embodiment of the application, the space occupied by the word vectors is reduced in a clustering mode during corpus processing, and a large amount of high-dimensionality matrix multiplication operations are not needed in the corpus compression process, so that the calculation power consumed during corpus processing can be reduced, and the corpus processing speed is increased.
In an embodiment, the clustering the first vectors based on the similarity between the first vectors includes:
distributing the first vectors with the same columns occupied in the first matrix to the same first set to obtain at least one first set;
and clustering the first vectors in each first set of the at least one first set according to the similarity among the first vectors.
According to the columns occupied by the first vectors in the first matrix, distributing the first vectors with the same columns occupied in the first matrix to the same first set to obtain at least one first set, and clustering the first vectors in each first set in the at least one first set according to the similarity among the first vectors.
For example, as shown in the first matrix diagram of fig. 2, the first vectors occupying the first column to the fourth column of the first matrix are allocated to the first set a, the first vectors occupying the fifth column to the eighth column of the first matrix are allocated to the first set B, the first vectors occupying the ninth column to the twelfth column of the first matrix are allocated to the first set C, and the first vectors occupying the thirteenth column to the sixteenth column of the first matrix are allocated to the first set D.
The occupied first vectors with the same columns are distributed to the same first set, each first set is clustered, elements with the same columns in the first matrix usually represent a certain type of feature corresponding to the first text, the first vectors representing the same feature are clustered, the meaning of the obtained second matrix representation is closer to that of the first matrix, and therefore the accuracy of model semantic identification is improved while the compression model uses the memory required by the corpus information.
In an embodiment, the generating a first matrix based on the corpus information includes:
determining at least two first texts based on a splitting result obtained by splitting the corpus information;
and performing feature extraction on each of the determined at least two first texts to generate the first matrix.
Splitting the corpus information to obtain a splitting result, determining at least two first texts based on the splitting result, performing feature extraction on each of the determined at least two first texts to obtain a feature vector corresponding to each first text, and generating a first matrix according to elements of the feature vector corresponding to the first text. The first text represents a text with high correlation degree with semantic recognition in the corpus information, and is used for inputting a model in a corpus processing result. The at least two first texts may form a list determined based on the corpus information, such as a word list, or a sentence list. And (3) extracting the characteristics of each first text, wherein the characteristic extraction can adopt a machine learning model of a word2vec algorithm, such as a Continuous Bag of Words (CBOW) model or a Skip-word Skip-Gram model. The quantity of the first texts can be adjusted in a feedback mode according to the expression effect of the corpus processing result on the model and the size of the memory occupied by model loading.
In an embodiment, the determining at least two first texts based on a splitting result obtained by splitting the corpus information includes:
splitting the corpus information to obtain at least two second texts;
counting the occurrence frequency of each second text in the corpus information, and determining a statistical result;
and determining at least two first texts in the at least two second texts based on the determined statistical results.
The method comprises the steps of splitting the corpus information to obtain at least two second texts as split results, determining a statistical result corresponding to each second text according to the frequency of occurrence of each second text in the split results of the corpus information, and determining at least two first texts in the at least two second texts based on the statistical result. The first text represents a text with high correlation degree with semantic recognition in a second text obtained by splitting the corpus information, and is used for inputting a model in a corpus processing result. Therefore, when the semantic recognition is carried out on the first text determined based on the occurrence frequency of the statistical corpus information, the accuracy of the recognition result is improved.
Here, the number of the first texts may be further adjusted by feedback according to the expression effect of the corpus processing result on the model and the size of the memory occupied by loading the model, so as to reduce the memory occupied by loading the corpus processing result on the model.
In an embodiment, the determining at least two first texts from the at least two second texts based on the determined statistics comprises:
determining the first text from at least two second texts, the number of times of occurrence of which in the corpus information meets a set threshold, based on the determined statistical result;
and/or the presence of a gas in the atmosphere,
determining the first text in the at least two second texts by using an IDF algorithm based on the determined statistical result.
The manner of determining the at least two first texts in the at least two second texts based on the determined statistical result may be to determine, as the first texts, the second texts whose occurrence times in the corpus information satisfy a set threshold, may be to determine, by using an IDF algorithm, the first texts in the at least two second texts, or may be a combination of the two manners used for determining the first texts.
For example, the corpus information 1 is "general cat: the head of the cat is round, the face is short, the corpus information 2 is that "common cats and dogs are pets with the highest breeding rate, the life of the pet is about 12-18 years", the corpus information 1 is split to obtain second texts of "common", "cat", "head", "round", "face", "short", the first frequencies of "common", "cat", "head", "round", "face", "short" are respectively 1, 2, 1, the cat "is determined as a first text, and the IDF algorithm is combined, so that the head", "round", "face", "short" are not present in the corpus information 2, namely the head "," round "," face "," short "are determined as the first text. The first text corresponding to the corpus information 1 is determined to be "cat", "head", "round", "facial", and "short" using the statistical first frequency and IDF algorithm.
The high-frequency text and the high IDF text are determined to be the first text with high correlation degree with semantic recognition in the splitting result through statistics of the occurrence times of the second text in the splitting result obtained through splitting the corpus information, therefore, the text determined through high frequency can usually reflect the theme of the corresponding corpus information, and the first text determined through the IDF algorithm is usually correlated with the details of the corpus information, so that the accuracy of the recognition result is improved when semantic recognition is carried out on the basis of the determined first text.
And determining the text with high importance as the first text by combining high frequency and/or high IDF based on the frequency of occurrence in the statistical corpus information, so that the accuracy of the recognition result is improved when the determined first text is subjected to semantic recognition.
In an embodiment, the determining at least two first texts based on a splitting result obtained by splitting the corpus information includes:
determining at least two first texts based on a splitting result obtained by splitting the corpus information according to a set word segmentation rule; the first text represents a word determined based on a splitting result of the corpus information.
In practical applications, the corpus information is usually processed into word vectors by PCA, LDA or Embedding. Here, the word segmentation rule may be set by means of space splitting, N-gram splitting using an N-gram algorithm, regular rule splitting, or the like. In this embodiment, the corpus information is split according to the set word segmentation rule, at least two words are determined based on the obtained splitting result, and the space occupied by word vectors is reduced in a clustering manner, without a large number of high-dimensionality matrix multiplication operations, so that the computational power consumed by corpus processing can be reduced, and the corpus processing speed is increased.
In an embodiment, the clustering the first vectors based on the similarity between the first vectors includes:
obtaining a similarity measurement matrix between the first vectors according to the similarity between the elements of the first vectors;
and clustering each first vector according to the similarity metric matrix.
And determining a similarity measurement matrix between the first vectors according to the similarity between the elements of the first vectors, and clustering the first vectors according to the determined similarity measurement matrix. Here, a similarity calculation method such as euclidean distance may be used to calculate a similarity between features of each first vector, and a similarity measurement matrix may be obtained according to the obtained similarity between features of each first vector.
For example, there are three first vectors, the similarity between the first vector I and the first vector J is Sl, the similarity between the first vector I and the first vector K is S2, the similarity between the first vector J and the first vector K is S3, and the similarity measure matrix between the three first vectors is (Sl, S2, S3).
Therefore, the similarity measurement matrix is determined by judging the similarity between the features of every two first vectors, so that the clustering of the first vectors is realized.
The present application will be described in further detail with reference to the following application examples.
In the field of NLP of machine learning, word vectors are a common word meaning representation, and in machine learning based on word vectors, the larger the dimension of a word vector of an input model is, the more accurately different words can be distinguished, and the dimension of the word vector is generally 300 to 600. However, the larger the dimension of the word vector is, the larger the memory occupied by the model when loading the word vector is, which may cause a situation that the memory occupied by the word list is too large in the process of loading the model by the electronic device. In the related art, in order to reduce the memory occupied by loading the model, when the corpus is processed, the word vector dimension reduction processing is usually realized by adopting modes such as PCA, LDA or deep network-based Embedding, and the like, and the problems of large amount of computing power consumption, low processing speed and need to be combined with a neural network exist. Here, a word vector refers to a manner of converting a word into a vector representation for data input for machine learning. The word vector dimension refers to the length of a word when the word is converted into a word vector, and word vectors with similar word senses are close in distance.
Therefore, the embodiment of the application provides a word vector compression mode, and the memory space occupied by the word vectors is reduced based on a clustering method, so that the overall memory size of the model is reduced, the purpose of model compression is achieved, and the compression of the word vector model is realized. Here, the model compression means that the size of the model loaded into the memory is reduced by a certain method, thereby reducing the memory consumption of the model. Clustering is a method of clustering vectors that are close in distance into a class based on the distance between the vectors.
With reference to fig. 3, the corresponding corpus processing method includes:
(1) Corpus information participle
The corpus information data is split according to a certain rule, and the commonly used word segmentation methods comprise space splitting, n-gram, regular rule splitting and the like. After splitting, a sequence will be represented in the form of multiple words.
(2) Counting word frequency and number of words
And according to the plurality of words split by each sequence, counting the number of each word in the training set and the total number of all sample words so as to facilitate the subsequent steps of word list selection, word number confirmation and the like.
(3) Screening vocabulary
And screening out words with high frequency of occurrence from the statistical word frequency information to serve as a word list. Here, the importance of the word may be determined by combining a Term Frequency Inverse Document Frequency (TF-IDF) algorithm or the like, and a word with a higher importance may be selected from the plurality of split words.
(4) Feature extraction
The method mainly comprises the step of carrying out word vector training by adopting a machine learning model of a word2vec algorithm, such as a CBOW model or a Skip-Gram model, so as to obtain extracted features. And acquiring a word vector corresponding to the word in the word list according to the text information of each word in the screened word list.
(5) Word vector compression
And clustering the word vectors according to the acquired word vectors by combining a clustering mode such as a k-means clustering method, thereby reducing the memory consumption of the word vectors, reducing the memory occupied by model loading and achieving the purpose of model compression.
In the word vector compression stage, as shown in fig. 4, the processing steps are mainly three steps, including first vector partitioning, first vector clustering, and cluster center replacement. And partitioning and clustering the first matrix, finding the cluster center of each block, and representing the corresponding first vector of the same cluster by using the cluster center. Here, the first matrix is represented as a two-dimensional matrix of a × b, each row being a vector representation of a word. Here, the cluster represents a data set of the same class in the clustering result, where there is a certain similarity between set elements.
First vector division: and splitting the first matrix into a plurality of sub-vector blocks according to rows.
Clustering: and clustering each first vector in each sub-vector block respectively to find a corresponding cluster center.
Cluster center replacement: and replacing the first vector by using a cluster center in each sub-vector block, wherein the final cluster center quantity is K × L, and the data quantity is reduced compared with the original data quantity a × b, so that the word vector memory consumption is reduced, the memory occupied by model loading is reduced, and the purpose of model compression is achieved.
(6) Feedback adjustment
And adjusting the size of a word list, the vector dimension for replacement during word vector compression and the number of cluster clusters according to the expression effect of the word vector on the model and the size of a memory occupied by loading the model.
In the embodiment of the application, the high dimensionality of the word vector is split to obtain a plurality of vector blocks, the first vectors of the sub-vector blocks are clustered, the corresponding cluster centers replace the first vectors, compared with a corpus processing scheme of a memory occupied by model loading in the related art, a large number of high-dimensionality matrix multiplication operations are required to be performed, in the embodiment of the application, the space occupied by the word vector is reduced in a clustering mode during corpus processing, and a large number of high-dimensionality matrix multiplication operations are not required to be performed in a corpus compression process, so that the computing power consumed in corpus processing can be reduced, and the corpus processing speed is improved. And simultaneously, feeding back and adjusting the size of a word list, the vector dimension for replacement during word vector compression and the quantity of cluster clusters according to the expression effect of the word vectors on the model and the size of a memory occupied by loading the model.
In order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides a corpus processing apparatus, as shown in fig. 5, the apparatus includes:
a generating unit 501, configured to generate a first matrix based on the corpus information; each row element of the first matrix represents a first text in the corpus information;
a dividing unit 502, configured to divide each row element of the first matrix into a first vector with a set dimension;
a clustering unit 503, configured to cluster the first vectors based on similarities between the first vectors to obtain at least one cluster;
a replacing unit 504, configured to replace the first vector in the corresponding cluster with a cluster center corresponding to each cluster obtained by clustering, so as to obtain a second matrix;
and the identifying unit 505 is configured to input the second matrix into a set NLP model, and obtain a semantic identification result about the corpus information.
In an embodiment, the clustering unit 503 is configured to:
distributing the first vectors with the same columns occupied in the first matrix to the same first set to obtain m first sets;
and clustering the first vectors in each of the m first sets according to the similarity among the first vectors.
In one embodiment, the dividing unit 502 is configured to:
determining at least two first texts based on a splitting result obtained by splitting the corpus information;
and performing feature extraction on each of the determined at least two first texts to generate the first matrix.
In an embodiment, the determining at least two first texts based on a splitting result obtained by splitting the corpus information includes:
splitting the corpus information to obtain at least two second texts;
counting the occurrence frequency of each second text in the corpus information, and determining a counting result;
and determining at least two first texts in the at least two second texts based on the determined statistical results.
In one embodiment, the determining at least two first texts among the at least two second texts based on the determined statistics comprises:
determining at least two second texts with the occurrence frequency meeting a set threshold value in the corpus information according to the determined statistical result;
and/or the presence of a gas in the gas,
determining the first text in the at least two second texts by using an IDF algorithm based on the determined statistical result.
In an embodiment, the determining at least two first texts based on a splitting result obtained by splitting the corpus information includes:
determining at least two first texts based on a splitting result obtained by splitting the corpus information according to a set word segmentation rule; the first text represents a word determined based on a splitting result of the corpus information.
In an embodiment, the clustering unit 503 is configured to:
obtaining a similarity measurement matrix between the first vectors according to the similarity between the elements of the first vectors;
and clustering the first vectors according to the similarity measurement matrix.
In practical applications, the generating Unit 501, the dividing Unit 502, the clustering Unit 503, the replacing Unit 504, and the identifying Unit 505 may be implemented by processors in a corpus Processing device, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA).
It should be noted that: in the corpus processing apparatus provided in the above embodiment, only the division of the above program modules is used for illustration when performing corpus processing, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the above-described processing. In addition, the corpus processing apparatus and the corpus processing method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments, and are not described herein again.
Based on the hardware implementation of the program module, and in order to implement the corpus processing method according to the embodiment of the present application, an embodiment of the present application further provides an electronic device, as shown in fig. 6, where the electronic device 600 includes:
a communication interface 610 capable of information interaction with other devices such as network devices and the like;
and the processor 620 is connected with the communication interface 610 to realize information interaction with other devices, and is used for executing the method provided by one or more technical solutions when running a computer program. And the computer program is stored on the memory 630.
Specifically, the processor 620 is configured to:
generating a first matrix based on the corpus information; each row element of the first matrix represents a first text in the corpus information;
dividing each row element of the first matrix into a first vector with a set dimension;
clustering the first vectors based on the similarity between the first vectors to obtain at least one cluster;
replacing the first vector in the corresponding cluster by using the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix;
and inputting the second matrix into a set Natural Language Processing (NLP) model to obtain a semantic recognition result about the corpus information.
In one embodiment, the processor 620 is configured to:
distributing the first vectors with the same columns occupied in the first matrix to the same first set to obtain at least one first set;
and clustering the first vectors in each first set of the at least one first set according to the similarity among the first vectors.
In one embodiment, the processor 620 is configured to:
determining at least two first texts based on a splitting result obtained by splitting the corpus information;
and performing feature extraction on each of the determined at least two first texts to generate the first matrix.
In one embodiment, the processor 620 is configured to:
splitting the corpus information to obtain at least two second texts;
counting the occurrence frequency of each second text in the corpus information, and determining a counting result; and determining at least two first texts in the at least two second texts based on the determined statistical results.
In one embodiment, the processor 620 is configured to:
determining at least two second texts, the number of times of occurrence of which in the corpus information meets a set threshold, as the first text based on the determined statistical result;
and/or the presence of a gas in the gas,
determining the first text in the at least two second texts by using an IDF algorithm based on the determined statistical result.
In one embodiment, the processor 620 is configured to:
determining at least two first texts based on a splitting result obtained by splitting the corpus information according to a set word segmentation rule; the first text represents a word determined based on a splitting result of the corpus information.
In one embodiment, the processor 620 is configured to:
obtaining a similarity measurement matrix between the first vectors according to the similarity between the elements of the first vectors;
and clustering each first vector according to the similarity metric matrix.
Of course, in practice, the various components in the electronic device 600 are coupled together by the bus system 640. It is understood that bus system 640 is used to enable communications among these components. The bus system 640 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 640 in fig. 6.
The memory 630 in the present embodiment is used to store various types of data to support the operation of the electronic device 600. Examples of such data include: any computer program for operating on the electronic device 600.
It will be appreciated that the memory 630 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a magnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), synchronous Static Random Access Memory (SSRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), synchronous Dynamic Random Access Memory (SLDRAM), direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 630 described in embodiments herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed in the embodiments of the present application may be applied to the processor 620, or may be implemented by the processor 620. Processor 620 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 620. The processor 620 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 620 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 630, and the processor 620 reads the program in the memory 630 and performs the steps of the aforementioned methods in conjunction with its hardware.
Optionally, when the processor 620 executes the program, the corresponding process implemented by the electronic device in the methods according to the embodiments of the present application is implemented, and for brevity, is not described again here.
In an exemplary embodiment, the present application further provides a storage medium, i.e., a computer storage medium, specifically a computer readable storage medium, for example, a memory 630 storing a computer program, which is executable by a processor 620 of an electronic device to perform the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, electronic device and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media capable of storing program code.
The technical means described in the embodiments of the present application may be arbitrarily combined without conflict. Unless otherwise specified and limited, the term "coupled" is to be construed broadly, e.g., as meaning electrical connections, or as meaning communications between two elements, either directly or indirectly through intervening media, as well as the specific meanings of such terms as understood by those skilled in the art.
In addition, in the examples of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order. It should be understood that "first \ second \ third" distinct objects may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be implemented in an order other than those illustrated or described herein.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Various combinations of the specific features in the embodiments described in the detailed description may be made without contradiction, for example, different embodiments may be formed by different combinations of the specific features, and in order to avoid unnecessary repetition, various possible combinations of the specific features in the present application will not be described separately.

Claims (10)

1. A corpus processing method, comprising:
generating a first matrix based on the corpus information; each row element of the first matrix represents a first text in the corpus information;
dividing each row element of the first matrix into a first vector with a set dimension;
clustering each first vector based on the similarity between the first vectors to obtain at least one cluster;
replacing the first vector in the corresponding cluster by using the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix;
and inputting the second matrix into a set Natural Language Processing (NLP) model to obtain a semantic recognition result about the corpus information.
2. The corpus processing method according to claim 1, wherein said clustering each first vector based on a similarity between each first vector comprises:
allocating first vectors with the same columns in the first matrix to the same first set to obtain at least one first set;
and clustering the first vectors in each first set of the at least one first set according to the similarity among the first vectors.
3. The corpus processing method according to claim 1, wherein said generating a first matrix based on corpus information comprises:
determining at least two first texts based on a splitting result obtained by splitting the corpus information;
and performing feature extraction on each of the determined at least two first texts to generate the first matrix.
4. The corpus processing method according to claim 3, wherein said determining at least two first texts based on a splitting result obtained by splitting the corpus information comprises:
splitting the corpus information to obtain at least two second texts;
counting the occurrence frequency of each second text in the corpus information, and determining a statistical result;
and determining at least two first texts in the at least two second texts based on the determined statistical results.
5. The corpus processing method according to claim 4, wherein said determining at least two first texts among said at least two second texts based on the determined statistical result comprises:
determining at least two second texts with the occurrence frequency meeting a set threshold value in the corpus information according to the determined statistical result;
and/or the presence of a gas in the gas,
and determining the first text in the at least two second texts by utilizing an inverse document frequency IDF algorithm based on the determined statistical result.
6. The corpus processing method according to any one of claims 3 to 5, wherein determining at least two first texts based on a splitting result obtained by splitting the corpus information comprises:
determining at least two first texts based on a splitting result obtained by splitting the corpus information according to a set word segmentation rule; the first text represents a word determined based on a splitting result of the corpus information.
7. The method according to any one of claims 1 to 5, wherein clustering the first vectors based on similarities between the first vectors comprises:
obtaining a similarity measurement matrix between the first vectors according to the similarity between the elements of the first vectors;
and clustering each first vector according to the similarity metric matrix.
8. A corpus processing apparatus, comprising:
the generating unit is used for generating a first matrix based on the corpus information; each row element of the first matrix represents a first text in the corpus information;
the dividing unit is used for dividing each row element of the first matrix into a first vector with a set dimension;
the clustering unit is used for clustering each first vector based on the similarity between the first vectors to obtain at least one cluster;
the replacing unit is used for replacing the first vector in the corresponding cluster by using the cluster center corresponding to each cluster obtained by clustering to obtain a second matrix;
and the recognition unit is used for inputting the second matrix into a set NLP model to obtain a semantic recognition result related to the corpus information.
9. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor,
wherein the processor is configured to execute the steps of the corpus processing method according to any one of claims 1 to 7 when running the computer program.
10. A storage medium having stored thereon a computer program for implementing the steps of the corpus processing method according to any one of the claims 1 to 7 when being executed by a processor.
CN202110835844.8A 2021-07-23 2021-07-23 Corpus processing method and device, electronic equipment and storage medium Pending CN115687606A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110835844.8A CN115687606A (en) 2021-07-23 2021-07-23 Corpus processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110835844.8A CN115687606A (en) 2021-07-23 2021-07-23 Corpus processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115687606A true CN115687606A (en) 2023-02-03

Family

ID=85044439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110835844.8A Pending CN115687606A (en) 2021-07-23 2021-07-23 Corpus processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115687606A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116882408A (en) * 2023-09-07 2023-10-13 南方电网数字电网研究院有限公司 Construction method and device of transformer graph model, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116882408A (en) * 2023-09-07 2023-10-13 南方电网数字电网研究院有限公司 Construction method and device of transformer graph model, computer equipment and storage medium
CN116882408B (en) * 2023-09-07 2024-02-27 南方电网数字电网研究院有限公司 Construction method and device of transformer graph model, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110021439B (en) Medical data classification method and device based on machine learning and computer equipment
US11163947B2 (en) Methods and systems for multi-label classification of text data
US10860654B2 (en) System and method for generating an answer based on clustering and sentence similarity
CN111930942B (en) Text classification method, language model training method, device and equipment
US20150095017A1 (en) System and method for learning word embeddings using neural language models
CN111291177A (en) Information processing method and device and computer storage medium
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN109829154B (en) Personality prediction method based on semantics, user equipment, storage medium and device
WO2014073206A1 (en) Information-processing device and information-processing method
CN113032253B (en) Test data feature extraction method, test method and related device
CN109062958B (en) Primary school composition automatic classification method based on TextRank and convolutional neural network
CN111241271B (en) Text emotion classification method and device and electronic equipment
CN115687606A (en) Corpus processing method and device, electronic equipment and storage medium
Zhang et al. Multi-document extractive summarization using window-based sentence representation
CN111553156A (en) Keyword extraction method, device and equipment
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
US11042520B2 (en) Computer system
US20210073258A1 (en) Information processing apparatus and non-transitory computer readable medium
US10896296B2 (en) Non-transitory computer readable recording medium, specifying method, and information processing apparatus
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium
CN115688771B (en) Document content comparison performance improving method and system
KR102642012B1 (en) Electronic apparatus for performing pre-processing regarding analysis of text constituting electronic medical record
KR102405938B1 (en) Sense vocabulary clustering method for word sense disambiguation and recording medium thereof
CN114091456B (en) Intelligent positioning method and system for quotation contents
Mayaluru One Class Text Classification using an Ensemble of Classifiers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination