CN116881463B

CN116881463B - Artistic multi-mode corpus construction system based on data

Info

Publication number: CN116881463B
Application number: CN202311132179.1A
Authority: CN
Inventors: 刘璟之; 白颢
Original assignee: Nanjing University Of Arts
Current assignee: Nanjing University Of Arts
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2024-01-26
Anticipated expiration: 2043-09-05
Also published as: CN116881463A

Abstract

The invention relates to the technical field of data processing, and provides a data-based artistic multimodal corpus construction system, which comprises: collecting a plurality of pieces of multi-mode corpus data of art, and segmenting words and parts of speech in the corpus data; obtaining the representing degree of each initial corpus data, obtaining an initial clustering center and clustering to obtain an initial cluster; obtaining the difference degree of the newly-added corpus data and each initial cluster, obtaining the representative characteristic value of each word in the newly-added corpus data and the relativity of the newly-added corpus data and the initial clusters, and obtaining the information loss degree of each word; acquiring a correction characteristic value of each word in the newly-added corpus data, acquiring a conversion vector of the newly-added corpus data according to the correction characteristic value, and clustering; and completing construction of an artistic corpus according to the clustering result. The invention aims to solve the problems that the artistic multimodal data has innovation to influence the similarity quantification among the multimodal data, and further the clustering cannot be effectively carried out and a corpus is constructed.

Description

Artistic multi-mode corpus construction system based on data

Technical Field

The invention relates to the technical field of data processing, in particular to an artistic multimodal corpus construction system based on data.

Background

The research of the artistic corpus provides important basis and support for semantic understanding, interdisciplinary research, artistic technical innovation, digital artistic works and other aspects in the art field, and researchers can more comprehensively and deeply understand and explore the essence and meaning of the art by constructing the artistic corpus; however, the traditional corpus mainly exists in a text form, and along with popularization of network and social media, a large amount of image, video, audio and other artistic multi-mode data appear on the network, and the multi-mode data carry rich semantic and emotion information, so that a more comprehensive and fine data base can be provided for constructing a more powerful language model and dialogue system, and more comprehensive and accurate semantic understanding and information expression can be obtained by combining data of different modes.

Constructing an artistic multimodal corpus is a complex task, wherein an important component is to process and organize multimodal data, and for the multimodal data, efficient clustering can help researchers find patterns and associations in the data, providing deeper understanding and insight; the current widely applied clustering method is a Single-Pass algorithm, which is used for the cluster analysis of the multi-mode data due to the simple and efficient implementation and no need of specifying the number of cluster categories in advance; however, in the clustering process, the method is only obtained according to the similarity among the multi-modal data in the clustering process, and the change of the data clustering sequence seriously affects the subsequent clustering process, but due to the innovativeness in the artistic multi-modal data, outliers are easy to appear in the clustering space, so that the clustering result is inaccurate in classifying the multi-modal data, and the construction of the artistic multi-modal corpus is affected.

Disclosure of Invention

The invention provides a data-based artistic multimodal corpus construction system, which aims to solve the problem that the similarity quantification among multimodal data is affected due to the innovation of the existing artistic multimodal data, so that clustering cannot be effectively carried out and a corpus is constructed, and adopts the following technical scheme:

one embodiment of the present invention provides a data-based artistic multimodal corpus construction system, comprising:

the corpus data acquisition module is used for acquiring corpus data to obtain a plurality of initial corpus data and newly added corpus data, and acquiring a plurality of word fragments of each corpus data and part of speech and word vectors of each word fragment;

and a corpus data clustering module: the method comprises the steps of obtaining a plurality of initial clustering centers according to the representing degree of each initial corpus data, and carrying out Single-Pass clustering according to the initial clustering centers and the feature vectors of each initial corpus data to obtain a plurality of initial clusters;

obtaining the difference degree of each new corpus data and each initial cluster according to the new corpus data and the word segmentation and the part of speech of the initial cluster center, and obtaining the representative characteristic value of each word segmentation in each new corpus data for each initial cluster according to the difference degree and the representative coefficient of each word segmentation; according to the representative characteristic value and the representative degree of the initial clustering center, the relation between each new increased corpus data and each initial cluster and the information loss degree of each word in each new increased corpus data are obtained;

According to the information loss degree and the representative characteristic value, acquiring a corrected characteristic value of each word segment in each newly-added corpus data for each initial cluster, acquiring a conversion vector of each newly-added corpus data for each initial cluster according to the corrected characteristic value, and performing Single-Pass clustering on the newly-added corpus data according to the conversion vector;

and the artistic corpus construction module is used for completing construction of the artistic corpus according to the clustering result.

Further, the method for obtaining a plurality of initial clustering centers according to the representing degree of each initial corpus data comprises the following specific steps:

obtaining the topic characterization degree of each initial corpus data according to the word segmentation and the part-of-speech distribution in the initial corpus data; first, theRepresentative degree of the individual initial corpus data->The calculation method of (1) is as follows:

wherein,indicate->Subject characterization degree of the initial corpus data, +.>Maximum value of the topic representation degree representing all initial corpus data,/->Indicate->The number of combinations of the initial corpus data and each of the other initial corpus data as a combination,/->Indicate->First ∈of the initial corpus data>The number of part-of-speech combinations in each combination, said part-of-speech combinations representing a combination of parts-of-speech obtained from each of the two initial corpus data, >Indicate->First ∈of the initial corpus data>The>In the part of speech combination->Word vectors of word segmentation with the largest occurrence frequency in all word segmentation under the part of speech corresponding to the initial corpus data; />Indicate->First ∈of the initial corpus data>The>In the part-of-speech combinations, except->Word vectors of word segments with the largest occurrence frequency in all word segments under part of speech corresponding to the other initial corpus data except the initial corpus data>Representing cosine similarity of the two vectors;

and obtaining the representing degree of each initial corpus data, and taking the initial corpus data with the representing degree larger than the representing threshold value as initial clustering centers to obtain a plurality of initial clustering centers.

Further, the topic representation degree of each initial corpus data is obtained by the specific method comprising the following steps:

for the firstThe method comprises the steps of obtaining initial corpus data, obtaining the number of word fragments with part of speech as nouns in the initial corpus data, marking the word fragments with part of speech as noun fragments, performing DBSCAN clustering on all noun fragments in the initial corpus data, wherein the clustering distance adopts the DTW distance of word vectors of the noun fragments to obtain a plurality of clusters, and marking the clusters as a plurality of noun types; for any one other part of speech, acquiring the total number of the part of speech in the initial corpus data, acquiring the number of the part of speech parts of each sentence distributed before noun word segmentation, marking the number as the prepositive number of the part of speech, and marking the ratio of the prepositive number to the total number as the prepositive probability of the part of speech; acquiring the pre-probability of each other part of speech in the initial corpus data +. >Subject characterization degree of the individual initial corpus data +.>The calculation method of (1) is as follows:

wherein,indicate->Number of noun words in the initial corpus data, < ->Indicate->Number of noun categories in the initial corpus data, +.>Indicate->The number of parts of speech other than the part of speech of the name word in the initial corpus data, ++>Indicate->The first part of speech except the part of speech of the name word in the initial corpus data>Leading probability of other parts of speech, +.>Indicate->The first part of speech except the part of speech of the name word in the initial corpus data>Number of parts of speech of other parts of speech, +.>Representing the number of all segmented words in all initial corpus data;

and obtaining the subject characterization degree of each initial corpus data.

Further, the method for obtaining the plurality of initial clusters comprises the following specific steps:

for any initial corpus data, arranging all word segmentation of the initial corpus data according to the appearance sequence in the initial corpus data, connecting word vectors of all word segmentation end to obtain corpus vectors of the initial corpus data, performing PCA dimension reduction on the corpus vectors, and marking the dimension-reduced vectors as feature vectors of the initial corpus data; and obtaining the feature vector of each initial corpus data, and carrying out Single-Pass clustering according to the initial clustering center and the feature vector of each initial corpus data to obtain a plurality of clusters, and marking the clusters as a plurality of initial clusters.

Further, the difference degree between each new corpus data and each initial cluster is obtained by the specific method that:

obtaining the subject characterization degree of each new corpus data, the firstThe new corpus data and +.>Degree of difference in the initial clusters +.>The calculation method of (1) is as follows:

wherein,indicate->Theme characterization degree of the new corpus data, < ->Maximum value representing the degree of characterization of the topic in all corpus data, < >>Indicate->New corpusData and->Number of part-of-speech combinations of initial cluster centers of the initial clusters, < >>Indicate->The new corpus data and +.>First of initial cluster center of initial clusters>In the part of speech combination->Word vectors of word segmentation with the largest occurrence frequency in all word segmentation under the part of speech corresponding to the newly added corpus data;indicate->The new corpus data and +.>First of initial cluster center of initial clusters>In the part of speech combination->Word vectors of the word segmentation with the largest occurrence number in all word segmentation under the part of speech corresponding to the initial clustering center of each initial cluster;

and obtaining the difference degree of each new corpus data and each initial cluster.

Further, the specific method for obtaining the representative feature value of each word segment for each initial cluster in each new corpus data includes the following steps:

Wherein,indicate->The +.f. in the new corpus data>Representative coefficient of individual word,/>Indicate->The number of all word segments in the new corpus data, < ->Indicate->The +.f. in the new corpus data>Number of occurrences of individual word->Indicate->The +.f. in the new corpus data>The mean value of the distribution probability of the individual word in the sentence, which is the ratio of the position of the individual word in each occurrence in the sentence to the length of the sentence, +.>Expressed in natural constantAn exponential function of the base;

acquisition of the firstThe representative coefficient of each word in the new corpus data is added, the representative coefficients of all the words are subjected to linear normalization, the obtained result is recorded as the representative weight of each word, the product of the representative weight of each word and the degree of difference is recorded as the (th) block>Each word in the new corpus data is corresponding to the first word>Representative feature values of the initial clusters; and obtaining a representative characteristic value of each word segment in each newly-added corpus data for each initial cluster.

Further, the method for obtaining the relevance of each new corpus data and each initial cluster and the information loss degree of each word segmentation in each new corpus data comprises the following specific steps:

Marking the word segmentation of each new corpus data for which the representative feature value of each initial cluster is larger than the feature threshold value as the representative word segmentation of each new corpus data for each initial cluster; first, theThe new corpus data and +.>Association of individual initial clusters +.>The calculation method of (1) is as follows:

wherein,indicate->The new corpus data is about +.>Number of representative segmentations of the initial clusters, < >>Indicate->The degree of representation of the initial cluster center of the individual initial clusters,/->Indicate->The new corpus data is about +.>First of the initial clusters->The individual representative word is about->Representative eigenvalue of the initial cluster, +.>Representing absolute value>An exponential function that is based on a natural constant;

acquiring the relation between each new corpus data and each initial cluster; and obtaining the information loss degree of each word in each new corpus according to the relation, the representing degree of the initial clustering center and the word in the new corpus.

Further, the information loss degree of each word in each new corpus data is obtained by the specific method that:

for the firstThe new corpus data is marked as the initial cluster with the relation with the new corpus data being larger than the relation threshold value, namely the relation cluster of the new corpus data is +. >The +.f. in the new corpus data>Information loss degree of individual word>The calculation method of (1) is as follows:

wherein,representing the number of initial clusters, +.>Indicate->The number of contact clusters of the new corpus data,indicate->The new corpus data and +.>The connectivity of the individual contact clusters, +.>Indicate->New additions ofCorpus data removal->After the word is divided from->The decorrelation of the individual contact clusters;

the calculating method of the heart-removing relation comprises the following steps: will be the firstAll +.>Removing individual word, and recalculate the new corpus data after removal and the +.>The difference degree of each contact cluster is obtained, the representative characteristic value of each word after removal is obtained, the contact after removal is obtained, and the contact after removal is marked as the +.>The new corpus data is removed +.>After the word is divided, and the word is +.>The decorrelation of the individual contact clusters;

and obtaining the information loss degree of each word segmentation in each new corpus data.

Further, the correction characteristic value of each word segment for each initial cluster in each new corpus data is obtained by the specific method that:

classifying multiple occurrences of the same word into one word segmentation type, wherein the information loss degree and the representative characteristic value of all the words under the same word segmentation type are the same; will be the first New addition languageAll word types in the material data are arranged from small to large according to the corresponding information loss degree, and the obtained sequence is marked as the +.>The word segmentation loss sequence of the new corpus data is added, all the word segments contained in the word segmentation categories with the previous loss number in the word segmentation loss sequence are formed into a set, and the set is marked as the +.>A loss word segmentation set of the new corpus data;

will be the firstThe new corpus data is divided into words for all the representatives of any one initial cluster to form a set, and the set is marked as the +.>Obtaining the +.f. of the new corpus data for the representative word segmentation set of the initial cluster>The new corpus data represents word segmentation sets for all initial clusters; acquiring an intersection common to all representative word segmentation sets and loss word segmentation sets, and marking all the word segments in the intersection as +.>Word segmentation to be corrected of the new corpus data;

first, theThe +.f. in the new corpus data>The word to be corrected is about +.>Correction eigenvalue of the initial cluster +.>The calculation method of (1) is as follows:

wherein,indicate->The number of words to be corrected of the new corpus data, +.>Indicate->The +.f. in the new corpus data>The number of occurrences of the individual word to be corrected in all the word to be corrected, < > >Indicate->The +.f. in the new corpus data>The word to be corrected is about +.>Representative feature values of the initial clusters;

and acquiring a correction characteristic value of each word to be corrected corresponding to each initial cluster in each new corpus data, and regarding the representative characteristic value of each word as the correction characteristic value of each word for other words except the word to be corrected in the new corpus data.

Further, the method for obtaining the conversion vector of each new corpus data for each initial cluster according to the correction eigenvalue includes the following specific steps:

for the firstThe new corpus data and +.>Initial clustering, namely, all word segmentation in the new corpus data are related to the firstThe corrected characteristic values of the initial clusters are arranged according to the word segmentation sequence, and the obtained sequence is marked as the characteristic value sequence of the new corpus data for the initial clusters; acquiring a characteristic value sequence of each new corpus data for each initial cluster;

and inputting the newly increased corpus data into a neural network after training, and combining corresponding characteristic value sequences to output and obtain a conversion vector of the newly increased corpus data for initial clustering.

The beneficial effects of the invention are as follows: according to the invention, the self-adaptive Single-Pass clustering is carried out in a mode of self-adaptively acquiring the feature vector and the conversion vector of the corpus data to be clustered, and an accurate clustering result is acquired to construct an artistic multimodal corpus; in order to avoid the influence of initial corpus data on clustering, initial clustering is carried out by adopting initial batch of corpus data, wherein the initial corpus data with large representation degree is adopted as an initial clustering center; obtaining a conversion vector of the corresponding new corpus data through a neural network in a mode of quantifying the characteristic value of the new corpus data; the feature value corresponding to the conversion vector of the new corpus data expected to be obtained by the method can be well distinguished from the relevance between clusters on the basis of reflecting the original representative degree (difference degree) as far as possible, and the feature value is comprehensively obtained by judging the loss cost under the proper information representation capability according to the relevance, so that the accurate vector value is obtained in a self-adaptive manner; by adopting the self-adaptive vectorization mode, the defects that the traditional Single-Pass clustering method only obtains according to the similarity among multi-mode data and the subsequent clustering process is seriously influenced by the change of the data clustering sequence are avoided, so that the obtained clustering result is more accurate, and a more accurate artistic corpus is obtained.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is a block diagram of an artistic multimodal corpus construction system based on data according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a block diagram of a data-based artistic multimodal corpus construction system according to an embodiment of the present invention is shown, where the system includes:

The corpus data acquisition module 101 acquires a plurality of pieces of corpus data of multiple modes of art, and acquires a plurality of word fragments of each piece of corpus data and the part of speech of each word fragment.

The aim of the embodiment is to construct an artistic corpus according to multi-modal corpus data by acquiring the artistic multi-modal corpus data, so that the artistic multi-modal corpus data is firstly required to be acquired; the collection number of the corpus data is set to 300 in the embodiment, wherein the artistic multi-mode corpus data comprises journal papers, books, teaching materials, audio data, video data and network resources, 50 corpus data of each mode are obtained, wherein 20 corpus data of each mode in an initial batch are obtained as subsequent initial corpus data, and the other batches are obtained as subsequent newly-added corpus data; text segmentation, transcription and word segmentation preprocessing is carried out on each corpus data, wherein the text corpus can be segmented according to sentences, paragraphs or appointed symbols, and the audio and video corpus is segmented into smaller segments according to specific marks (such as time stamps or audio segment boundaries); the audio corpus and the video corpus are transcribed by using an Automatic Speech Recognition (ASR) technology, then the jeba segmentation is adopted for segmentation processing, a plurality of segmentation words are obtained for each corpus data, and the part of speech of each segmentation word is obtained in the jeba segmentation, wherein the segmentation processing, text segmentation and automatic speech recognition technologies are known technologies, and the embodiment is not repeated.

Thus, a plurality of corpus data, a plurality of word fragments of each corpus data and the part of speech of each word fragment are obtained.

Corpus data clustering module 102:

it should be noted that, in the Single-Pass clustering (incremental clustering) process, the initial clustering is only obtained according to the similarity between corpus data, and the change of the data clustering sequence seriously affects the subsequent clustering process, but due to the innovation in the artistic multi-mode data, outliers are easy to appear in the clustering space; therefore, for the construction of an artistic multimodal corpus, in the incremental clustering process, self-adaptive Single-Pass clustering is expected to be carried out on newly added corpus data to be clustered in a mode of self-adaptively acquiring corpus vectors to be clustered; calculating the representative degree through the initial batch of corpus data, selecting an initial clustering center according to the representative degree, clustering the initial corpus data, and combining the obtained initial clusters to obtain feature values for the newly added corpus data so as to obtain a corpus vector more suitable for the clustering.

(1) Initial corpus data are obtained, the representative degree of each initial corpus data is obtained according to word segmentation and part of speech of the initial corpus data, an initial clustering center is obtained, and Single-Pass clustering is carried out on the initial corpus data to obtain a plurality of initial clusters.

It should be noted that, because the Single-Pass clustering method is affected by the initial clustering sequence, in order to avoid the clustering influence of the initial clustering on the subsequent corpus data, the initial clustering is performed by adopting the corpus data of the initial batch, wherein the corpus data with large representative degree is adopted as the initial clustering center; the representing degree is obtained by the quantization of the capability of the information represented by the initial corpus data, and the more the representing information is, the larger the corresponding representing degree is; the representative degree is characterized by the capability of the initial corpus data to characterize the topic and the difference between the initial corpus data and other initial corpus data, wherein on the basis of the capability of the initial corpus data to characterize the topic, the initial corpus data is corrected through the difference between the initial corpus data and other initial corpus data, if the difference between the initial corpus data and other initial corpus data is larger, the capability of the initial corpus data to characterize the topic is more representative, and the corresponding representative degree of the initial corpus data is larger.

Specifically, vector conversion is performed on each Word segment in each initial corpus data and each newly added corpus data through a Word2vec model to obtain a Word vector of each Word segment, word2vec is a known technology, and is not described in detail in this embodiment, wherein the number of elements of the Word vector is set to be 5 in this embodiment; in the first place Taking initial corpus data as an example, obtaining the number of word fragments with part of speech as nouns in the initial corpus data, marking the word fragments with part of speech as nouns as noun fragments, performing DBSCAN clustering on all noun fragments in the initial corpus data, and marking a plurality of clusters as a plurality of noun types if the clustering distance adopts the DTW distance of the word vector of the noun fragments; for any one other part of speech, acquiring the total number of the part of speech in the initial corpus data, wherein the corpus data consists of a plurality of sentences, acquiring the number of the part of speech parts before the noun part of speech of each sentence, marking the number as the prepositive number of the part of speech, and marking the ratio of the prepositive number to the total number as the prepositive probability of the part of speech; acquiring the pre-probability of each other part of speech in the initial corpus data, and then performing topic expression on the initial corpus dataDegree->The calculation method of (1) is as follows:

wherein,indicate->Number of noun words in the initial corpus data, < ->Indicate->Number of noun categories in the initial corpus data, +.>Indicate->The number of parts of speech other than the part of speech of the name word in the initial corpus data, ++>Indicate->The first part of speech except the part of speech of the name word in the initial corpus data >Leading probability of other parts of speech, +.>Indicate->The first part of speech except the part of speech of the name word in the initial corpus data>Number of parts of speech of other parts of speech, +.>Representing the number of all segmented words in all initial corpus data; because the topic information is mostly contained in nouns, the information distribution is represented by the ratio of the noun types to the number of noun words, the larger the ratio is, the more the noun types are, the stronger the capability of characterizing the topic information is, and the greater the topic characterization degree is; meanwhile, the information distribution of nouns is corrected through the information distribution of other parts of speech, the larger the prepositive probability of other parts of speech is, the more the parts of speech are distributed before noun word segmentation, the larger the capability of characterizing subject information is, meanwhile, the more the number of parts of speech is, the larger the ratio of parts of speech in the number of all parts of speech is, the more important the parts of speech is, and the more the subject characterization degree is; and obtaining the subject characterization degree of each initial corpus data according to the method.

Further, in the first stepFor example, the initial corpus data represents the degree +.>The calculation method of (1) is as follows:

wherein,indicate->Subject characterization degree of the initial corpus data, +.>Maximum value of the topic representation degree representing all initial corpus data,/- >Indicate->The number of combinations obtained by combining each initial corpus data with each other initial corpus data is the difference obtained by subtracting 1 from the number of initial corpus data, in this embodiment, the number of initial corpus data is 120->；/>Indicate->First ∈of the initial corpus data>The number of part-of-speech combinations in each combination, wherein the part-of-speech combinations represent that each part-of-speech obtained in two initial corpus data forms a combination, and the number of the part-of-speech combinations is the product of the number of the parts-of-speech contained in the two initial corpus data; />Indicate->First ∈of the initial corpus data>The>In the part of speech combination->Word vectors of the word segment having the largest number of occurrences among all word segments in the part of speech corresponding to the initial corpus data, e.g., the +.>The part of speech corresponding to the initial corpus data is noun +.>The word segmentation with the largest occurrence frequency in all noun word segmentation in the initial corpus data is 'art', and then the word vector of the 'art' is represented; />Indicate->First ∈of the initial corpus data>The>In the part-of-speech combinations, except->Word vectors of word segments with the largest occurrence frequency in all word segments under part of speech corresponding to the other initial corpus data except the initial corpus data >Representing cosine similarity of the two vectors;

quantifying the representative degree through the topic characterization degree, wherein the larger the topic characterization degree is, the larger the representative degree is; meanwhile, the representative degree is quantified through the difference of part-of-speech combination, the smaller the similarity of word vectors of two word segmentation under the part-of-speech combination is, the larger the word segmentation difference is, the larger the information difference represented by the initial corpus data under the part-of-speech combination is, and the larger the corresponding representative degree is; wherein the range of values due to cosine similarity isAnd the range of 1 minus cosine similarity is +.>Thus byAfter averaging, multiplying ++>To realize normalization processing; according to the method, the representative degree of each initial corpus data is obtained, a representative threshold value is preset, the representative threshold value of the embodiment is recited by 0.78, and the initial corpus data with the representative degree larger than the representative threshold value is used as initial clustering centers, so that a plurality of initial clustering centers are obtained.

Further, taking any initial corpus data as an example, arranging all word segmentation of the initial corpus data according to the appearance sequence in the initial corpus data, connecting word vectors of all word segmentation end to obtain a corpus vector of the initial corpus data, performing PCA dimension reduction on the corpus vector, setting the target dimension as 100 in the embodiment, and marking the dimension reduced vector as a feature vector of the initial corpus data; it should be noted that, each occurrence of one word is participated in calculation for multiple occurrences, and each occurrence of each word is participated in calculation for each occurrence of each word if no special explanation exists in the calculation process of the subsequent step; according to the method, the feature vector of each initial corpus data is obtained, and according to the initial clustering center and the feature vector of each initial corpus data, single-Pass clustering is carried out to obtain a plurality of clusters, and the clusters are recorded as a plurality of initial clusters.

So far, a plurality of initial clustering centers are obtained, and a plurality of initial clusters are obtained.

(2) Obtaining new corpus data, obtaining the difference degree of the new corpus data and each initial cluster according to word segmentation and part of speech and an initial clustering center of the new corpus data, obtaining the representative characteristic value of each word segmentation and each initial cluster in the new corpus data according to the difference degree, and obtaining the information loss degree of each word segmentation in the new corpus data according to the relation and the initial clusters, wherein the relation between the new corpus data and the initial clusters is obtained.

It should be noted that, on the basis of the clustering process of the initial corpus data, the newly added corpus data is expected to extract the proper vector data thereof, and accurate clustering is performed according to each initial clustering center, so that the situation that a plurality of wrong clustering results occur in the traditional Single-Pass process due to the unique innovation of the artistic corpus data is avoided; therefore, the conversion vector of the newly-added corpus data is obtained through a neural network in a mode of quantifying the characteristic value of the newly-added corpus data; the feature value corresponding to the conversion vector of the expected new corpus data can be well distinguished from the relevance between each cluster on the basis of representing the difference degree obtained by quantifying the original representative degree as far as possible, and the feature value is comprehensively obtained by properly reducing the loss cost under the information representation capability according to the relevance judgment, so that the accurate vector value is obtained in a self-adaptive manner.

It should be further noted that, in the process of obtaining the feature value from the newly added corpus data to obtain the conversion vector, it is necessary to ensure that the conversion vector can achieve accurate matching, so as to avoid the innovative influence of the artistic corpus data; because the relevance of the new corpus data to each initial cluster is analyzed, the relevance between the new corpus data and the initial cluster center of each initial cluster is quantized; in the analysis process, although the artistic corpus data has innovation, the analysis information of the same part of speech is only different in mode property, but the characteristic subject information is basically the same, so that the determination is needed according to the analysis of the relation among clusters; and obtaining the difference degree through the newly added corpus data and the initial clustering center, further obtaining the representative characteristic value and the relation of each word by combining the distribution and the part of speech of the word, and simultaneously calculating the information loss degree.

Specifically, the subject characterization degree of each new corpus data is obtained according to the method, so as to obtain the firstThe new corpus data are exemplified by, for example, the +.>Degree of difference in the initial clusters +.>The calculation method of (1) is as follows:

wherein, Indicate->Theme characterization degree of the new corpus data, < ->Maximum value of topic characterization degree in all corpus data (including initial expected data and newly added corpus data), and +.>Indicate->The new corpus data and +.>Number of part-of-speech combinations of initial cluster centers of the initial clusters, < >>Indicate->The new corpus data and +.>First of initial cluster center of initial clusters>In the part of speech combination->Word vectors of word segmentation with the largest occurrence frequency in all word segmentation under the part of speech corresponding to the newly added corpus data; />Indicate->The new corpus data and +.>First of initial cluster center of initial clusters>In the part of speech combination->Word vectors of word segments with the largest occurrence frequency in all word segments under part of speech corresponding to initial clustering centers of initial clusters, and (I)>Representing cosine similarity of the two vectors; the degree of difference is obtained by a calculation method similar to the representative degree, only the difference between two corpus data is calculated, the difference under a plurality of combinations is not averaged, and the larger the representative degree is, the larger the difference between the newly added corpus data and the initial clustering center is, the larger the degree of difference is.

Further, for the first The +.f. in the new corpus data>A word segment, the representing coefficient of the word segment->The calculation method of (1) is as follows:

wherein,indicate->The number of all word segments in the new corpus data, < ->Indicate->The +.f. in the new corpus data>Number of occurrences of individual word->Indicate->The +.f. in the new corpus data>The average value of the distribution probability of each word in the sentence, wherein the distribution probability is the ratio of the position of each word in the sentence to the length of the sentence, the units of the position and the length are the number of words, the position is the number of words in the sentence, and the length is the number of words in the sentence;representing an exponential function based on natural constants, the present embodiment employs +.>Model to present inverse proportional relationship and normalization process, < ->For inputting the model, an implementer can set an inverse proportion function and a normalization function according to actual conditions; the more the word segmentation occurs, the stronger the capability of representing the topic information in the newly-added corpus data is possible, and meanwhile, the smaller the distribution probability is, the more the distribution position is, the more the front word segmentation in the sentence can reflect the topic information, and the larger the representing coefficient is; obtaining->The representative coefficient of each word in the new corpus data is added, the representative coefficients of all the words are subjected to linear normalization, the obtained result is recorded as the representative weight of each word, the product of the representative weight of each word and the degree of difference is recorded as the (th) block >Each word in the new corpus data is corresponding to the first word>Representative feature values of the initial clusters; and obtaining the difference degree of each new increased corpus data and each initial cluster according to the method to obtain the representative characteristic value of each word segment in each new increased corpus data for each initial cluster.

Further, a feature threshold is preset, the feature threshold is described by 0.6, and the word segmentation of the representative feature value of each new corpus data for each initial cluster is recorded as the representative word segmentation of each new corpus data for each initial cluster; in the first placeThe new corpus data is exemplified by->The new corpus data and +.>Association of individual initial clusters +.>The calculation method of (1) is as follows:

wherein,indicate->The new corpus data is about +.>Number of representative segmentations of the initial clusters, < >>Indicate->The degree of representation of the initial cluster center of the individual initial clusters,/->Indicate->The new corpus data is about +.>First of the initial clusters->The individual representative word is about->Representative eigenvalue of the initial cluster, +.>Representing absolute value>Representing an exponential function based on natural constants, the present embodiment employs +. >Model to present inverse proportional relationship and normalization process, < ->For inputting the model, the implementer can set up the inverse proportion function according to the actual situationA number and normalization function; comparing the representative characteristic value of the representative word with the representative degree of the initial clustering center, wherein the closer the ratio is to 1, the larger the association between the corresponding newly-added corpus data and the initial clustering is; and acquiring the relation between each new corpus data and each initial cluster according to the method.

Further, a contact threshold is preset, and the contact threshold of the embodiment is described by 0.65, so thatFor example, the initial cluster with the relation with the new corpus data being larger than the relation threshold is recorded as the relation cluster of the new corpus data, and the +.>The +.f. in the new corpus data>Information loss degree of individual word>The calculation method of (1) is as follows:

wherein,representing the number of initial clusters, +.>Indicate->The number of contact clusters of the new corpus data,indicate->The new corpus dataAnd (4) the first->The connectivity of the individual contact clusters, +.>Indicate->The new corpus data is removed +.>After the word is divided, and the word is +.>The center-removed relevance of each contact cluster is calculated by the following steps: will be- >All +.>The individual word is removed, i.e.)>Removing each occurrence of each word segmentation, and recalculating the removed new corpus data and the (th) word segmentation>The difference degree of each contact cluster is obtained, then the representative characteristic value of each word after removal is obtained, and further the connectivity after removal is obtained, and the connectivity after removal is marked as the +.>The new corpus data is removed +.>After the word is divided, and the word is +.>The decorrelation of the individual contact clusters; />Representing absolute value; the information loss degree is quantified through the ratio of the heart-free relativity to the relativity, and the smaller the relativity change is, the smaller the information loss degree of the word segmentation is; meanwhile, the number of the contact clusters also has an influence on the information loss degree, and the more the number of the contact clusters is, the more the information loss is, so that the information loss degree is increased, and the information loss degree is corrected by the duty ratio of the contact clusters; and obtaining the information loss degree of each word in each new corpus data according to the method.

Thus, the representative feature value and the information loss degree of each word in the newly-added corpus data and the relation between the newly-added corpus data and each initial cluster are obtained.

(3) Correcting the representative feature value of each word in the newly-added corpus data according to the information loss degree to obtain a corrected feature value, acquiring a conversion vector of the newly-added corpus data according to the corrected feature value, and performing Single-Pass clustering.

After obtaining the information loss degree of the word segmentation, obtaining the word segmentation needing to correct the representative characteristic value according to the word segmentation set with smaller information loss degree and the representative word segmentation of each initial cluster; by acquiring the intersection between the word segmentation set and the representative word, the word segmentation in the intersection can influence the classification result between the initial clusters, so that the feature value needs to be corrected to ensure the accuracy of the clustering result, and the possibility of inaccurate clustering result caused by repeated times is reduced.

Specifically, by the firstFor example, the new corpus data is obtained by classifying multiple occurrences of the same word into a word class, wherein the word class comprises a plurality of identical words, and a loss number is preset, and the loss number in this embodiment is described by 10, and the information loss degree and representative characteristic value of all the words in the same word class are the same, so that all the words in the new corpus data are classified into the same word class The word types are arranged from small to large according to the corresponding information loss degree, the obtained sequence is marked as a word segmentation loss sequence of the new corpus data, all word segments contained in the word segmentation types with the previous loss number in the word segmentation loss sequence form a set, and the set is marked as a loss word segmentation set of the new corpus data; meanwhile, the new corpus data is segmented into all representative words of any initial cluster to form a set, the set is marked as the representative word segmentation set of the new corpus data for the initial cluster, and the representative word segmentation set of the new corpus data for all initial clusters is obtained; acquiring an intersection common to all representative word segmentation sets and loss word segmentation sets, and marking all word segmentation in the intersection as word segmentation to be corrected of the newly added corpus data, namely +.>The +.f. in the new corpus data>The word to be corrected is about +.>Correction eigenvalue of the initial cluster +.>The calculation method of (1) is as follows:

wherein,indicate->The number of words to be corrected of the new corpus data, +.>Indicate->Number of corpus to be addedAccording to->The number of occurrences of the individual word to be corrected in all the word to be corrected, < >>Indicate->The +.f. in the new corpus data>The word to be corrected is about +. >Representative feature values of the initial clusters; the larger the number of repeated occurrence is, the smaller the representative feature value needs to be adjusted, so that the influence on the clustering result is avoided, and the newly-added corpus data is clustered in error; according to the method, the correction characteristic value of each word to be corrected corresponding to each initial cluster in each new corpus data is obtained, and for other words except for the word to be corrected in the new corpus data, the representative characteristic values of the words are used as correction characteristic values, namely the representative characteristic values of the words do not need to be corrected.

Further, in the first stepThe new corpus data and +.>For example, the initial clusters are obtained by adding all the word segments to the new corpus data>The corrected characteristic values of the initial clusters are arranged according to the word segmentation sequence, and the obtained sequence is marked as the characteristic value sequence of the new corpus data for the initial clusters; acquiring a characteristic value sequence of each new corpus data for each initial cluster according to the method; constructing a neural network to obtain a characteristic value sequence according to the newly added corpus dataThe method comprises the steps of taking conversion vectors, wherein a neural network adopts a CNN network, a loss function adopts a root mean square error function, a training data set is all newly added corpus data, a feature value sequence is used as a label for each newly added corpus data, output data is vectors of each initial cluster according to the feature value sequence, the vectors are obtained through Word2Vec model conversion, the element number of the vectors is set to be 100, namely the element number of the vectors is equal to the element number of the feature vectors of each initial corpus data in the initial clusters, the obtained vectors are marked as conversion vectors of the newly added corpus data to the initial clusters, and the trained neural network is obtained; and inputting the new corpus data into a trained neural network, combining the corresponding characteristic value sequences, outputting and obtaining a conversion vector of the new corpus data for the initial cluster, and performing Single-Pass clustering on the conversion vector of the initial cluster according to the new corpus data to finish clustering on the new corpus data.

So far, the conversion vector of the new corpus data to the initial clustering is obtained, and further, the Single-Pass clustering is continuously carried out on the new corpus data.

And the artistic corpus construction module 103 completes construction of the artistic corpus according to the clustering result.

After clustering is completed on all the newly added corpus data, a clustering result can be obtained, a plurality of clusters are obtained, each cluster is marked as a cluster, texts are grouped according to the clustering result, folders or labels are created for each cluster, and corpus data belonging to the cluster are stored in the corresponding folder; then constructing an artistic corpus based on clusters, and easily searching and accessing artistic data of specific topics or contents according to the needs; after the new corpus data is obtained, the new corpus data can be clustered according to the method, and the new corpus data is put into a corresponding folder according to a clustering result, so that the update of the artistic corpus is realized.

Thus, the construction of the artistic corpus of the multi-modal data is completed.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. An artistic multi-modal corpus construction system based on data, the system comprising:

The artistic corpus construction module is used for completing construction of the artistic corpus according to the clustering result;

the method for acquiring the plurality of initial clustering centers according to the representing degree of each initial corpus data comprises the following specific steps:

wherein,indicate->Subject characterization degree of the initial corpus data, +.>Maximum value of the topic representation degree representing all initial corpus data,/->Indicate->The number of combinations of the initial corpus data and each of the other initial corpus data as a combination,/->Indicate->First ∈of the initial corpus data>The number of part-of-speech combinations in each combination, said part-of-speech combinations representing a combination of parts-of-speech obtained from each of the two initial corpus data,>indicate->First ∈of the initial corpus data>The>In the part of speech combination->Word vectors of word segmentation with the largest occurrence frequency in all word segmentation under the part of speech corresponding to the initial corpus data; />Indicate->First ∈of the initial corpus data>The >In the part-of-speech combinations, except->Word vectors of word segments with the largest occurrence frequency in all word segments under part of speech corresponding to the other initial corpus data except the initial corpus data>Representing cosine similarity of the two vectors;

obtaining the representing degree of each initial corpus data, and taking the initial corpus data with the representing degree larger than a representing threshold value as initial clustering centers to obtain a plurality of initial clustering centers;

the difference degree between each new corpus data and each initial cluster is obtained by the specific method:

wherein,indicate->Theme characterization degree of the new corpus data, < ->Maximum value representing the degree of characterization of the topic in all corpus data, < >>Indicate->The new corpus data and +.>Number of part-of-speech combinations of initial cluster centers of the initial clusters, < >>Indicate->The new corpus data and +.>First of initial cluster center of initial clusters>In the part of speech combination->Word vectors of word segmentation with the largest occurrence frequency in all word segmentation under the part of speech corresponding to the newly added corpus data; / >Indicate->The new corpus data and +.>First of initial cluster center of initial clusters>In the part of speech combination->Word vectors of the word segmentation with the largest occurrence number in all word segmentation under the part of speech corresponding to the initial clustering center of each initial cluster;

obtaining the difference degree of each new corpus data and each initial cluster;

the specific method for obtaining the representative characteristic value of each word segment in each new corpus data for each initial cluster comprises the following steps:

wherein,indicate->The +.f. in the new corpus data>Representative coefficient of individual word,/>Indicate->The number of all word segments in the new corpus data, < ->Indicate->The +.f. in the new corpus data>Number of occurrences of individual word->Indicate->The +.f. in the new corpus data>The mean value of the distribution probability of the individual word in the sentence, which is the ratio of the position of the individual word in each occurrence in the sentence to the length of the sentence, +.>An exponential function that is based on a natural constant;

acquisition of the firstThe representative coefficient of each word in the new corpus data is added, the representative coefficients of all the words are subjected to linear normalization, the obtained result is recorded as the representative weight of each word, the product of the representative weight of each word and the degree of difference is recorded as the (th) block >Each word in the new corpus data is corresponding to the first word>Representative feature values of the initial clusters; and obtaining a representative characteristic value of each word segment in each newly-added corpus data for each initial cluster.

2. The system for constructing a multi-modal corpus based on data according to claim 1, wherein the topic representation degree of each initial corpus data is obtained by the following specific method:

and obtaining the subject characterization degree of each initial corpus data.

3. The system for constructing a data-based artistic multimodal corpus according to claim 1, wherein the obtaining of the plurality of initial clusters comprises the following specific steps:

4. The system for constructing a multi-modal corpus of art based on data according to claim 1, wherein the method for obtaining the relevance of each new corpus data to each initial cluster and the information loss degree of each word segment in each new corpus data comprises the following specific steps:

marking the word segmentation of each new corpus data with the representative feature value of each initial cluster larger than the feature threshold value as each new corpus data with each initial clusterRepresenting word segmentation; first, theThe new corpus data and +.>Association of individual initial clusters +.>The calculation method of (1) is as follows:

5. The system for constructing a multi-modal corpus based on data as claimed in claim 4, wherein the information loss degree of each word segment in each new corpus data is obtained by the following specific method:

for the firstThe new corpus data is marked as the initial cluster with the relation with the new corpus data being larger than the relation threshold value, namely the relation cluster of the new corpus data is +.>The +.f. in the new corpus data>Information loss degree of individual word>The calculation method of (1) is as follows:

wherein,representing the number of initial clusters, +.>Indicate->The number of contact clusters of the new corpus data,/->Indicate->The new corpus data and +.>The connectivity of the individual contact clusters, +.>Indicate->The new corpus data is removed +.>After the word is divided from->The decorrelation of the individual contact clusters;

the calculating method of the heart-removing relation comprises the following steps: will be the firstAll +.>Removing individual word, and recalculate the new corpus data after removal and the +.>The difference degree of each contact cluster is obtained, the representative characteristic value of each word after removal is obtained, the contact after removal is obtained, and the contact after removal is marked as the +. >The new corpus data is removed +.>After the word is divided, and the word is +.>The decorrelation of the individual contact clusters;

6. The system for constructing a data-based artistic multimodal corpus according to claim 4, wherein the method for obtaining the corrected feature value of each word segment for each initial cluster in each new corpus data comprises the following steps:

classifying multiple occurrences of the same word into one word segmentation type, wherein the information loss degree and the representative characteristic value of all the words under the same word segmentation type are the same; will be the firstAll word types in the new corpus data are arranged from small to large according to the corresponding information loss degree, and the obtained sequence is marked as the +.>The word segmentation loss sequence of the new corpus data is added, and all the word segments contained in the word segmentation categories with the previous loss number in the word segmentation loss sequence form a setMarked as->A loss word segmentation set of the new corpus data;

will be the firstThe new corpus data is divided into words for all the representatives of any one initial cluster to form a set, and the set is marked as the firstObtaining the +.f. of the new corpus data for the representative word segmentation set of the initial cluster >The new corpus data represents word segmentation sets for all initial clusters; acquiring an intersection common to all representative word segmentation sets and loss word segmentation sets, and marking all the word segments in the intersection as +.>Word segmentation to be corrected of the new corpus data;

wherein,indicate->The number of words to be corrected of the new corpus data, +.>Indicate->The +.f. in the new corpus data>The number of occurrences of the individual word to be corrected in all the word to be corrected, < >>Indicate->The +.f. in the new corpus data>The word to be corrected is about +.>Representative feature values of the initial clusters;

7. The system for constructing a data-based artistic multimodal corpus according to claim 1, wherein the method for obtaining the conversion vector of each new corpus data for each initial cluster according to the corrected feature value comprises the following specific steps:

For the firstThe new corpus data and +.>Initial clustering, namely, all word segments in the new corpus data are about the +.>The corrected characteristic values of the initial clusters are arranged according to the word segmentation sequence, and the obtained sequence is marked as the characteristic value sequence of the new corpus data for the initial clusters; acquiring a characteristic value sequence of each new corpus data for each initial cluster;