CN107506415B - Large text high-order semantic tensorial classification method and system based on content - Google Patents

Large text high-order semantic tensorial classification method and system based on content Download PDF

Info

Publication number
CN107506415B
CN107506415B CN201710687437.0A CN201710687437A CN107506415B CN 107506415 B CN107506415 B CN 107506415B CN 201710687437 A CN201710687437 A CN 201710687437A CN 107506415 B CN107506415 B CN 107506415B
Authority
CN
China
Prior art keywords
tensor
dec
vector
class
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710687437.0A
Other languages
Chinese (zh)
Other versions
CN107506415A (en
Inventor
谭培波
史晓凌
茹海燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhitong Yunlian Technology Co Ltd
Original Assignee
Beijing Zhitong Yunlian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhitong Yunlian Technology Co Ltd filed Critical Beijing Zhitong Yunlian Technology Co Ltd
Priority to CN201710687437.0A priority Critical patent/CN107506415B/en
Publication of CN107506415A publication Critical patent/CN107506415A/en
Application granted granted Critical
Publication of CN107506415B publication Critical patent/CN107506415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a high-order semantic tensor classification method of a large text based on contents, which comprises the following steps: step one, constructing a DEC tensor model of a class; and step two, acquiring partial texts in the large texts to be classified, constructing a DEC tensor of the large texts by using the partial texts, logically multiplying the DEC tensor by the characteristic tensor of the class, then carrying out all dimension reduction addition according to 3 dimensions of the DEC to obtain the strength of the large texts belonging to the class, and displaying an output result. The invention also discloses a high-order semantic sheet quantitative classification system of the large text based on the content, which comprises the following steps: the basic language layer is used for storing DEC tensor model elements and language material samples required by model processing; the DEC semantic processing layer is used for completing word segmentation and DEC tensor of the large text and realizing calculation and calling of a tensor model; and the application layer is used for receiving the text input by the user and displaying the classification result. The invention solves the contradiction between insufficient computing resources and text understanding accuracy.

Description

Large text high-order semantic tensorial classification method and system based on content
Technical Field
The invention belongs to the technical field of text classification, and relates to a high-order semantic tensor classification method and system for a large text based on contents.
Background
With the development of the internet, a great deal of knowledge is present in the network literature. But web documents are typically short text, such as on the scale of no more than 1 page a 4. In the domestic scientific literature, for example, the text on cnki is about 5 pages generally, and the doctor's paper is about 60-100 pages generally. However, the scientific research result report facing the field is generally about 300 pages in scale, and the word number is about 10 ten thousand; and often the pictures and texts are in a perfect state, the stored format is mainly pdf, and the conversion from pdf to txt is needed, and a large amount of messy codes exist in the conversion, which can interfere the classification accuracy of scientific research result documents.
The traditional classification method based on article sentence similarity needs similarity calculation among the sentences of the last ten thousand sentences, has large calculation amount, and cannot meet the requirement of engineering projects on processing speed. In some methods, classification is performed according to a bag-of-words model, however, accuracy cannot meet the requirement due to lack of understanding of text semantics.
Disclosure of Invention
An object of the present invention is to solve at least the above problems and/or disadvantages and to provide at least the advantages described hereinafter.
It is still another object of the present invention to provide a method for high-order semantic tensoriation classification of large texts based on content.
It is still another object of the present invention to provide a high-order semantic sheet quantitative classification system for large text based on content.
Therefore, the technical scheme provided by the invention is as follows:
a high-order semantic tensorial classification method and a system for large texts based on contents comprise the following steps:
step one, constructing a DEC tensor model of a class:
1.1) performing 3-level field segmentation on words of a plurality of large texts, and establishing a corresponding relation between a large text word set and a class of each large text;
2.1) DEC tensor for building block large text: performing word segmentation on each large text corresponding to each class to obtain a word segmentation set D representing Domain of the large text, a word segmentation set C representing business activity Concept and a word segmentation set Element word segmentation set E representing related elements;
2.2) firstly obtaining an independent one-dimensional vector of E, and then expanding the independent one-dimensional vector of E into a CE tensor;
2.3) tensor from CE tensor to DEC tensor;
2.4) circulating from the step 2.3) to the step 2.1) to complete tensor construction of the whole class;
3.1) randomly selecting a class, and adding the rest class tensors to obtain an opposite case tensor of the class;
3.2) subtracting the counterexample tensor of the class from the tensor of the whole class to obtain the characteristic tensor of the class;
step two, obtaining a part of texts in the large texts to be classified, firstly constructing a DEC tensor of the large texts to be classified by using the part of texts according to the steps 2.1) to 2.3), then loading the feature tensor of the class in the step one, logically multiplying the DEC tensor of the large texts to be classified and the feature tensor of each class, then carrying out all dimension reduction addition on the class tensors after the logical multiplication according to 3 dimensions of DEC to obtain the strength of the large texts to be classified belonging to the class, and finally displaying an output result.
Preferably, in the method for classifying a high-order semantic tensor of a large text based on content, when constructing a DEC tensor model of a class in step one, the method further includes the following steps:
firstly, selecting a plurality of words representing Domain in a Domain as a Domain word set, and establishing a D table; selecting a plurality of words representing the service activity Concept as a service activity verb set, and establishing a C table;
in step 2.1), when each large text is participated, firstly reading the D table and the C table, then removing elements in the D table and the C table from a participated set of the large text, and then combining the rest word sets into a participated set E to establish an E table.
Preferably, in the content-based method for classifying high-order semantic tensorials of large texts, in step 2.2), the specific method of obtaining the independent one-dimensional vector of E and then generating the CE tensor image by expanding the independent one-dimensional vector of E includes:
firstly, sorting a D and C word segmentation set of a large text according to a D table and a C table, constructing separate one-dimensional vectors of the D and C words, and sorting an E word segmentation set in an E table for several times to obtain an independent one-dimensional vector of the E;
then, a 0 vector with the same size as the E vector is constructed, namely each element is 0, and the 0 vector and the E vector are combined into a 0 → E vector pair according to rows; repeating n words by the vector pair according to rows to obtain a candidate set of the CE tensor, wherein n is the dimensionality of the C vector;
and finally, obtaining the segmentation ordinal number vector of the 0 → E pair in the CE candidate set according to the size of the C vector, adding the segmentation ordinal number vector and the C vector to obtain a set, namely, selecting the ordinal number of the 0 vector or the E vector from the candidate set according to the C vector value, selecting, and completing the expansion from the E vector to the CE tensor.
Preferably, in the method for classifying a high-order semantic tensor of a large text based on content, in step 2.3), a specific method for tensor DEC from CE tensor includes:
flattening the CE tensor into a first-order vector, constructing a 0 vector with the same dimensionality, and constructing a 0 → CE vector pair;
expanding the 0 → CE vector pair by n times according to the line to form a 0 → CE vector candidate set, wherein n is the dimension of the D vector;
determining the sequence number vector of the 0 → CE pair according to the dimension of the D vector; adding the vector and the D vector to obtain the sequence number of the DEC vector;
and carrying out selection to obtain the DEC tensor of the large text.
Preferably, in the method for classifying the high-order semantic tensorial content-based large texts, in step 1.1), large texts with different formats are first converted into txt files; each large set of text words of the large text may correspond to multiple classes.
Preferably, in the method for classifying a high-order semantic tensor of a large text based on content, after step 3.2) and before step two, the method further includes the following steps:
and 3.3) converting the feature tensor of the class into a json dictionary format suitable for calling, and outputting the json dictionary format as input loaded in the step two.
A high-order semantic sheet quantitative classification system for large texts based on contents comprises the following steps:
the system comprises a basic language layer and a data processing layer, wherein the basic language layer is used for storing elements of a DEC tensor model and language material samples required by processing the DEC tensor model, the language material samples comprise word segmentation results of large files, large file names and corresponding relations between the large files and classes, and the basic language layer comprises;
the DEC semantic processing layer is in communication connection with the basic language layer and is used for completing word segmentation of a large text, DEC tensor quantization and realizing calculation and calling of a tensor model;
and the application layer is in communication connection with the DEC semantic processing layer and is used for receiving the text input by the user and displaying the classification result.
Preferably, in the content-based large text high-order semantic sheet quantitative classification system, the basic language layer includes:
the model element module is connected with the DEC processing module and comprises a D table, a C table and a classification structure tree, wherein the D table comprises a plurality of words representing field entity objects, the words are arranged according to the sequence of the importance degrees, the C table comprises a plurality of words representing service activities, the words are arranged according to the sequence of the importance degrees of the service activities, and the classification structure tree is a service system knowledge structure for describing service logic;
and the corpus classifying module is used for storing the corpus samples required by the processing of the DEC tensor model.
Preferably, in the content-based large text high-order semantic sheet quantization classification system, the DEC semantic processing layer includes:
the 3-level word segmentation module realizes the continuous increase of word segmentation granularity according to the increasing sequence of 2-3 word segmentation, 4 word segmentation and more than 5 word segmentation;
the DEC tensor model calculation and calling module comprises a DEC processing module, a classification model calculation module and a classification model calling module, wherein the DEC processing module is in communication connection with the basic language layer and is used for completing the tensor process of each large text, tensor operation is performed by taking the whole DEC as a unit according to the sequence specified by DEC through ordering, the text is reconstructed, the classification model calculation module is used for constructing a unique feature tensor model of each class, and the classification model calling module is in communication connection with the application layer and is used for calculating the position distribution of a newly-increased text in the DEC space by calling the calculated classification tensor model when in application so as to obtain an output classification result.
Preferably, in the system for classifying a high-order semantic sheet of a large text based on content, the application layer includes:
the text receiving module comprises an editable input text box and a receiving button, and is in communication connection with the DEC processing module;
and the classification result display module comprises a label module for displaying and is used for displaying the result.
The invention at least comprises the following beneficial effects:
the invention recovers the semantics of the text level on the basis of words through the DEC semantic tensor model, embodies the content of the text to the maximum extent, has high speed and high accuracy of content classification, and overcomes the difficulties of large calculation amount based on sentences and difficult construction based on sections.
The method establishes the text-level DEC tensor model on the basis of word segmentation, does not analyze sentences and calculate the similarity between the sentences, avoids the waste of computing resources caused by large computation amount, has high speed, does not lose the grasp of the whole text semantics, and solves the contradiction between insufficient computing resources and text understanding accuracy.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
FIG. 1 is a schematic diagram of a DEC tensor model of classes as described in one embodiment of the present invention;
FIG. 2 is a diagram of a high-level semantic sheet quantization classification system for content-based large texts according to an embodiment of the present invention;
FIG. 3 is a flow chart of high-order semantic tensor modeling of large text based on content in one embodiment of the present invention;
FIG. 4 is a flowchart of a content-based calling process of a high-level semantic tensor model for large texts in an embodiment of the present invention.
Detailed Description
The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.
It will be understood that terms such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other elements or groups thereof.
As shown in fig. 1 to 4, the present invention provides a content-based method for classifying a high-order semantic tensor of a large text, which includes the following steps:
step one, constructing a DEC tensor model of a class:
1.1) performing 3-level field segmentation on words of a plurality of large texts, and establishing a corresponding relation between a large text word set and a class of each large text;
2.1) DEC tensor for building block large text: performing word segmentation on each large text corresponding to each class to obtain a word segmentation set D representing Domain of the large text, a word segmentation set C representing business activity Concept and a word segmentation set Element word segmentation set E representing related elements;
2.2) firstly obtaining an independent one-dimensional vector of E, and then expanding the independent one-dimensional vector of E into a CE tensor;
2.3) tensor from CE tensor to DEC tensor;
2.4) circulating from the step 2.3) to the step 2.1) to complete tensor construction of the whole class;
3.1) randomly selecting a class, and adding the rest class tensors to obtain an opposite case tensor of the class;
3.2) subtracting the counterexample tensor of the class from the tensor of the whole class to obtain the characteristic tensor of the class;
step two, obtaining a part of texts in the large texts to be classified, firstly constructing a DEC tensor of the large texts to be classified by using the part of texts according to the steps 2.1) to 2.3), then loading the feature tensor of the class in the step one, logically multiplying the DEC tensor of the large texts to be classified and the feature tensor of each class, then carrying out all dimension reduction addition on the class tensors after the logical multiplication according to 3 dimensions of DEC to obtain the strength of the large texts to be classified belonging to the class, and finally displaying an output result.
In the above aspect, preferably, in the step one, when constructing the DEC tensor model of the class, the method further includes the steps of:
firstly, selecting a plurality of words representing Domain in a Domain as a Domain word set, and establishing a D table; selecting a plurality of words representing the service activity Concept as a service activity verb set, and establishing a C table;
in step 2.1), when each large text is participated, firstly reading the D table and the C table, then removing elements in the D table and the C table from a participated set of the large text, and then combining the rest word sets into a participated set E to establish an E table.
In the above scheme, preferably, in step 2.2), the specific method of obtaining the independent one-dimensional vector of E first and then expanding the CE tensor image from the independent one-dimensional vector of E includes:
firstly, sorting a D and C word segmentation set of a large text according to a D table and a C table, constructing separate one-dimensional vectors of the D and C words, and sorting an E word segmentation set in an E table for several times to obtain an independent one-dimensional vector of the E;
then, a 0 vector with the same size as the E vector is constructed, namely each element is 0, and the 0 vector and the E vector are combined into a 0 → E vector pair according to rows; repeating n words by the vector pair according to rows to obtain a candidate set of the CE tensor, wherein n is the dimensionality of the C vector;
and finally, obtaining the segmentation ordinal number vector of the 0 → E pair in the CE candidate set according to the size of the C vector, adding the segmentation ordinal number vector and the C vector to obtain a set, namely, selecting the ordinal number of the 0 vector or the E vector from the candidate set according to the C vector value, selecting, and completing the expansion from the E vector to the CE tensor.
In the above aspect, preferably, in step 2.3), the specific method for expanding the DEC tensor from the CE tensor includes:
flattening the CE tensor into a first-order vector, constructing a 0 vector with the same dimensionality, and constructing a 0 → CE vector pair;
expanding the 0 → CE vector pair by n times according to the line to form a 0 → CE vector candidate set, wherein n is the dimension of the D vector;
determining the sequence number vector of the 0 → CE pair according to the dimension of the D vector; adding the vector and the D vector to obtain the sequence number of the DEC vector;
and carrying out selection to obtain the DEC tensor of the large text.
In one embodiment of the present invention, preferably, in step 1.1), large texts in different formats are first converted into txt files; each large set of text words of the large text may correspond to multiple classes.
In one embodiment of the present invention, preferably, after step 3.2) and before step two, the following steps are further included:
and 3.3) converting the feature tensor of the class into a json dictionary format suitable for calling, and outputting the json dictionary format as input loaded in the step two.
A high-order semantic sheet quantitative classification system for large texts based on contents comprises the following steps:
the system comprises a basic language layer 1, a document processing layer and a document processing layer, wherein the basic language layer 1 is used for storing DEC tensor model elements and language material samples required by processing a DEC tensor model, the language material samples comprise word segmentation results of large files, large file names and corresponding relations between the large files and classes, and the basic language layer comprises;
the DEC semantic processing layer 2 is in communication connection with the basic language layer and is used for completing word segmentation of a large text, DEC tensor quantization and realizing calculation and calling of a tensor model;
and the application layer 3 is in communication connection with the DEC semantic processing layer and is used for receiving the text input by the user and displaying the classification result.
In one embodiment of the present invention, the base language layer 1 preferably includes:
a model element module 110 connected to the DEC processing module, the model element module including a D table 111, a C table 112 and a classification structure tree 113, the D table 111 containing a plurality of words representing domain entity objects, the plurality of words being arranged according to a sequence of importance degrees, the C table 112 containing a plurality of words representing business activities, the plurality of words being arranged according to a sequence of importance degrees of the business activities, the classification structure tree 113 being a business system knowledge structure describing business logic;
and a corpus classifying module 120, configured to store the corpus samples required by the processing of the DEC tensor model.
In one embodiment of the present invention, preferably, the DEC semantic processing layer 2 includes:
the 3-level word segmentation module 210 is used for realizing the continuous increase of word segmentation granularity according to the increasing sequence of 2-3 word segmentation, 4 word segmentation and more than 5 word segmentation;
the DEC tensor model calculating and calling module 220 comprises a DEC processing module 221, a classification model calculating module 222 and a classification model calling module 223, wherein the DEC processing module is in communication connection with the basic language layer and is used for completing the tensor process of each large text, tensor operation is performed by taking the whole DEC as a unit according to the sequence specified by DEC through ordering, the text is reconstructed, the classification model calculating module is used for constructing a unique feature tensor model of each class, and the classification model calling module is in communication connection with the application layer and is used for calculating the position distribution of a newly-increased text in the DEC space by calling the calculated classification tensor model when in application to obtain an output classification result.
In one embodiment of the present invention, preferably, the application layer 3 includes:
a text accepting module 310 comprising an editable input text box and a receive button, said text accepting module communicatively coupled to said DEC processing module;
a classification result display module 320, which includes a label module for displaying, for displaying the results.
In order that those skilled in the art will better understand the present invention, the following examples are now provided for illustration:
the invention provides a method and a system for semantic tensor classification of a large text based on contents, as shown in fig. 1, the main method is to understand the contents of the whole document as that people carry out certain research (Concept) on related elements (Element) of a certain field (Domain), for example, "the evolution of the structure of an Orthos basin and the research of the natural gas accumulation history of a carbonate rock layer system of the next ancient world" can be understood as that an author carries out C (research and the research of the accumulation history) on related elements E (structure evolution and the accumulation history) of D (Orthos basin, carbonate rock layer system and natural gas). Matching with E obtained from a pre-literature, and performing DEC independent sequencing according to an important order to construct a basic frame of the DEC three-dimensional model; carrying out vectorization modeling on each large document according to three dimensions of DEC, wherein each document is a point in the DEC three-dimensional space; all documents contained in each class are a region in this DEC three-dimensional space. The regions of all documents belonging to each class are positive example regions, the regions of the documents belonging to other classes are negative example regions, and the spaces of the positive example regions and the negative example regions are crossed; the positive example area excludes all negative example areas, and the rest is the unique characteristic area belonging to the class; the unique regions of all classes are combined to form the characteristic tensor model of the whole classification system. When the model is applied, firstly, carrying out DEC tensor modeling on a document to obtain the distribution of the document in a DEC space; then determining the distribution point number of the document falling in each individual class area, namely the strength of the document belonging to the class; and sequencing the class strengths to finally obtain the classification result of the literature.
The method establishes a text-level DEC tensor model on the basis of word segmentation, does not analyze sentences and calculate the similarity between the sentences, avoids the waste of computing resources caused by large computation, has high speed, does not lose the grasp of the whole text semantics, and solves the contradiction between insufficient computing resources and text understanding accuracy.
The high-order semantic sheet quantitative classification system for large texts is shown in fig. 2. The logic system is divided into a basic language layer 1, a DEC semantic processing layer 2 and an application layer 3. The basic language material layer stores model elements and all language material samples required by model processing; the DEC corpus processing layer completes DEC quantization of the text and realizes calculation and calling of the model; and the application layer completes the receiving of the user input text and the display of the final classification result.
The base corpus consists of a model element module 110 and a classified corpus module 120. The model element module 110 is composed of a D table 111, a C table 112 and a classification structure tree 113; the D table 111 is a group of words representing the domain entity objects, is ordered according to the sequence of the importance, and is a text file separated by commas, as shown in table 1; c table 112 is a set of words associated with the business activity, ordered by the importance of the activity, and is a text file separated by commas, as shown in Table 2; the classification structure tree 113 is generally a stable business system that is reviewed and confirmed by an authority, is a business system knowledge that describes business logic and is common to different companies in the same field, and is a tree structure, as shown in table 3. The corpus classifying module 2 is a file storage system, and stores the word segmentation result of each document after performing word segmentation and cleaning according to the file name and class of each document and the word segmentation result after word segmentation and cleaning, as shown in table 4, the table describes the relationship between the factors and the results of the sample, and all model parameters are calculated based on the table.
TABLE 1C Table
Optimization technique, history research and prospect
TABLE 2D Table
Ordos disc, layer section, carbonate rock series, berm rim
Figure BDA0001377090640000091
Figure BDA0001377090640000101
The DEC semantic processing layer 2 is composed of a 3-level word module 210 and a DEC tensor model calculating and calling module 220, and completes word segmentation and DEC tensor tasks of a text. The 3-level word segmentation module 210 is a general internal word segmentation module, and realizes the continuous increase of word segmentation granularity according to the increasing sequence of 2-3 word segmentation, 4 word segmentation and 5 word segmentation, namely, the semantic understanding transfer process from general words to special words and from texts to reality is realized. The DEC tensor model calculation and invocation module 220 is composed of a DEC processing module 221, a classification model calculation module 222, and a classification model invocation module 223. The DEC processing module 221 is configured to complete a tensoriation process of the text, and order the words in the disordered order by ordering, and reconstruct the actual meaning of the text according to the order specified by DEC, so that the sample occupies a certain range of space in the DEC three-dimensional space. The calculation of the DEC processing block 221 performs tensor operation in units of the whole of DEC, and is not divided into 3 single-dimensional vectors for calculation and integration. When constructing the DEC tensor model, the DEC processing module 221 first constructs a single-dimensional vector of DCE; then, according to the assignment of the C, the expansion of the CE is realized by selecting the 0 vector and the E vector; and finally, selecting a 0 vector and a CE vector according to the assignment of D, thereby realizing the construction of the DEC tensor of the whole large text. During model calculation, the DEC processing module 221 performs DEC tensor on the speech; at the time of model calling, the DEC processing block 221 performs DEC tensor transformation on the input text. The classification model calculation module 222 is used only when calculating a model, integrates the total space occupied by all sample documents, and excludes the space occupied by other classes to construct a unique feature model of a classification system. The classification model calculation module 222 only sorts out all documents included in the class when calculating the model; circularly superposing the DEC tensors of each document of each class to obtain a DEC total tensor model of the class; performing the same operation on all classes to obtain regular example spaces of all classes; taking other classes except each class as counter example spaces, and accumulating all the counter example spaces to obtain a final counter example space of each class; then, the opposite case tensor DEC is subtracted from the class DEC tensor to obtain the whole class DEC feature tensor model, namely, the class features are calculated. The classification model calling module 223 calculates the position distribution of the newly added document in the DEC space by calling the calculated classification tensor model only when in application, so as to obtain an output classification result. The classification model calling module 223 performs logical and operation, that is, logical multiplication, on the DEC tensor of the input document and the DEC tensor of the whole class to obtain a DEC calculation value of each class; and finally, carrying out dimensionality reduction summation of three dimensionalities DEC on tensor values of all classes to obtain the total distribution times of the documents in all classes, namely the strength of the documents belonging to the classes.
The application layer 3 is composed of an input text accepting module 310 and a classification result displaying module 320. The text accepting module 310 includes an editable input text box, and a receive button; the classification result display module 320 is composed of a displayed label module. The information received by the text receiving module 310 is transmitted to the server through the front end, and then enters the DEC semantic processing layer 2 of the server to realize classification, and then the classification result is transmitted to the front end classification result display module 320 to realize display of the result.
Referring to fig. 3, a specific modeling method for content-based large text semantic sheet quantitative classification is as follows:
step S1: class → file correspondence handling
Step S110: under a storage directory, converting the formats of all texts with different formats into txt files, dividing words in 3-level fields, cleaning to obtain a meaningful and clean document word set, and storing the meaningful and clean document word set in a specified new directory;
step S120: reading the text → class correspondence table;
step S130: converting the text → class corresponding table into a class → text set corresponding relation, and processing the problem of multi-classification, wherein the same document can be distributed in different classes;
step S2: constructing a document's DEC tensor;
step S210: reading a word segmentation set of each document under each class;
step S220: reading the D table and the C table, and after removing elements of the D table and the C table from the word set of the document, taking the remaining word set as an E set;
step S230: step S231: the D, C sets of documents are sorted according to D table and C table to construct D, C individual one-dimensional vectors. E sets are sorted according to frequency in the whole E table to obtain independent one-dimensional vectors of E;
step S232: constructing a 0 vector with the same size as the E vector, namely each element is 0; combining the 0 vector and the E vector by rows into a 0 → E vector pair; repeating n words by the vector pair according to rows to obtain a candidate set of the CE tensor, wherein n is the dimensionality of the C vector;
step S233: firstly, obtaining a segmentation sequence number vector of a 0 → E pair in a CE candidate set according to the size of a C vector; the sequence number vector and the C vector are added to obtain a set, namely the sequence number of 0 vector or E vector selected from the candidate set according to the C vector value; selecting to finish expanding the E vector into a CE tensor;
step S240: implementing the expansion from the CE tensor to the DEC tensor;
step S241: flattening the CE tensor into a first-order vector, constructing a 0 vector with the same dimensionality, and constructing a 0 → CE vector pair;
step S242: expanding the 0 → CE vector pair by n times according to the line to form a 0 → CE vector candidate set, wherein n is the dimension of the D vector;
step S243: firstly, determining the sequence number vector of the 0 → CE pair according to the dimension of the D vector; adding the vector and the D vector to obtain the sequence number of the DEC vector; selecting to obtain the final DEC tensor of the large text
Looping from step S243 to step 210 to complete the construction of the positive example DEC tensor of the entire class;
step S3: completing the calculation of class characteristics;
step S310: randomly selecting one class, and adding the tensors of the other classes to obtain the counterexample tensor of the class;
step S320: subtracting the opposite case tensor of the class from the positive case tensor of the class to obtain the characteristic tensor of the class;
step S330: and converting the class tensor into a json dictionary format suitable for calling, and outputting the json dictionary format as the input of the model calling module.
Referring to fig. 4, the calling flow of the content-based large text semantic tensor model is described as follows:
step S210: reading in a text from a foreground, wherein the format is txt;
step S220: dividing words of the text by 3 grades, and cleaning to obtain a clean text word set;
step S230: reading the D table and the C table, and calculating an E set;
step S240: constructing a CE vector in the same steps as 2-3 of FIG. 3;
step S250: constructing a DEC tensor of the input text in the same steps as 2-4 of the figure 3;
step S260: loading a DEC tensor model of the class;
step S270: logically multiplying the text DEC tensor and the class tensor;
step S280: carrying out dimensionality reduction addition on the class tensors obtained in the step 7 according to 3 dimensionalities of DEC to obtain the strength of the class to which the document belongs;
step S290: and sorting the output results of the classes according to the display requirements.
The number of modules and the processing scale described herein are intended to simplify the description of the invention. Applications, modifications and variations of the present invention for the content-based large text high-order semantic tensorial classification method and system will be apparent to those skilled in the art.
While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims (10)

1. A high-order semantic tensorial classification method for large texts based on contents is characterized by comprising the following steps:
step one, constructing a DEC tensor model of a class:
1.1) performing 3-level field segmentation on the words of a plurality of large texts, and establishing the corresponding relation between a large text word set and a class of each large text;
2.1) DEC tensor for building block large text: performing word segmentation on each large text corresponding to each class to obtain a word segmentation set D representing the Domain of the large text, a word segmentation set C representing the business activity Concept and a word segmentation set Element word segmentation set E representing related elements;
2.2) firstly obtaining an independent one-dimensional vector of E, and then expanding the independent one-dimensional vector of E into a CE tensor;
2.3) tensor from CE tensor to DEC tensor;
2.4) circulating from the step 2.3) to the step 2.1) to complete tensor construction of the whole class;
3.1) randomly selecting a class, and adding the rest class tensors to obtain an opposite case tensor of the class;
3.2) subtracting the counterexample tensor of the class from the tensor of the whole class to obtain the characteristic tensor of the class;
step two, obtaining a part of texts in the large texts to be classified, firstly constructing a DEC tensor of the large texts to be classified by using the part of texts according to the steps 2.1) to 2.3), then loading the feature tensor of the class in the step one, logically multiplying the DEC tensor of the large texts to be classified and the feature tensor of each class, then carrying out all dimension reduction addition on the class tensors after the logical multiplication according to 3 dimensions of DEC to obtain the strength of the large texts to be classified belonging to the class, and finally displaying an output result.
2. The method for classifying content-based large text high-order semantic tensoriation according to claim 1, wherein in the step one, when constructing the DEC tensor model of the class, the method further comprises the steps of:
firstly, selecting a plurality of words representing Domain in a Domain as a Domain word set, and establishing a D table; selecting a plurality of words representing the service activity Concept as a service activity verb set, and establishing a C table;
in step 2.1), when each large text is participated, firstly reading the D table and the C table, then removing elements in the D table and the C table from a participated set of the large text, and then combining the rest word sets into a participated set E to establish an E table.
3. The method as claimed in claim 2, wherein in step 2.2), the independent one-dimensional vector of E is obtained first, and then the CE tensor imaging method is performed by expanding the independent one-dimensional vector of E, and the method includes:
firstly, sorting a D and C word segmentation set of a large text according to a D table and a C table, constructing separate one-dimensional vectors of the D and C words, and sorting an E word segmentation set in an E table for several times to obtain an independent one-dimensional vector of the E;
then, a 0 vector with the same size as the E vector is constructed, namely each element is 0, and the 0 vector and the E vector are combined into a 0 → E vector pair according to rows; repeating the vector pair for n times according to rows to obtain a candidate set of the CE tensor, wherein n is the dimensionality of the C vector;
and finally, obtaining the segmentation ordinal number vector of the 0 → E pair in the CE candidate set according to the size of the C vector, adding the segmentation ordinal number vector and the C vector to obtain a set, namely, selecting the ordinal number of the 0 vector or the E vector from the candidate set according to the C vector value, selecting, and completing the expansion from the E vector to the CE tensor.
4. The method for classifying content-based large text high-order semantic tensoriation as recited in claim 2, wherein in step 2.3), the specific method for expanding from the CE tensor to the DEC tensor comprises:
flattening the CE tensor into a first-order vector, constructing a 0 vector with the same dimensionality, and constructing a 0 → CE vector pair;
expanding the 0 → CE vector pair by n times according to the line to form a 0 → CE vector candidate set, wherein n is the dimension of the D vector;
determining the sequence number vector of the 0 → CE pair according to the dimension of the D vector; adding the vector and the D vector to obtain the sequence number of the DEC vector;
and carrying out selection to obtain the DEC tensor of the large text.
5. The method for classifying the high-order semantic tensoriation of the large text based on the content as claimed in claim 1, wherein in the step 1.1), the large text in each different format is firstly converted into a txt file; each large text set of words may correspond to multiple classes.
6. The method for classifying content-based large text high-order semantic tensoriation according to claim 1, wherein after step 3.2) and before step two, the method further comprises the following steps:
and 3.3) converting the feature tensor of the class into a json dictionary format suitable for calling, and outputting the json dictionary format as input loaded in the step two.
7. A high-order semantic sheet quantitative classification system for large texts based on contents is characterized by comprising the following steps:
the system comprises a basic language layer, a database layer and a database layer, wherein the basic language layer is used for storing elements of a DEC tensor model and language material samples required by processing the DEC tensor model, and the language material samples comprise word segmentation results of a large text, large text names and corresponding relations between the large text and classes;
the DEC semantic processing layer is in communication connection with the basic language layer and is used for completing word segmentation of a large text, DEC tensor quantization and realizing calculation and calling of a tensor model;
and the application layer is in communication connection with the DEC semantic processing layer and is used for receiving the text input by the user and displaying the classification result.
8. The system of claim 7, wherein the DEC semantic processing layer comprises:
the 3-level word segmentation module realizes the continuous increase of word segmentation granularity according to the increasing sequence of 2-3 word segmentation, 4 word segmentation and more than 5 word segmentation;
the DEC tensor model calculation and calling module comprises a DEC processing module, a classification model calculation module and a classification model calling module, wherein the DEC processing module is in communication connection with the basic language layer and is used for completing the tensor process of each large text, tensor operation is performed by taking the whole DEC as a unit according to the sequence specified by DEC through ordering, the text is reconstructed, the classification model calculation module is used for constructing a unique feature tensor model of each class, and the classification model calling module is in communication connection with the application layer and is used for calculating the position distribution of a newly-increased text in the DEC space by calling the calculated classification tensor model when in application so as to obtain an output classification result.
9. The system according to claim 8, wherein the base layer comprises:
the model element module is connected with the DEC processing module and comprises a D table, a C table and a classification structure tree, wherein the D table comprises a plurality of words representing field entity objects, the words are arranged according to the sequence of the importance degrees, the C table comprises a plurality of words representing service activities, the words are arranged according to the sequence of the importance degrees of the service activities, and the classification structure tree is a service system knowledge structure for describing service logic;
and the corpus classifying module is used for storing the corpus samples required by the processing of the DEC tensor model.
10. The system according to claim 8, wherein the application layer comprises:
the text receiving module comprises an editable input text box and a receiving button, and is in communication connection with the DEC processing module;
and the classification result display module comprises a label module for displaying and is used for displaying the result.
CN201710687437.0A 2017-08-11 2017-08-11 Large text high-order semantic tensorial classification method and system based on content Active CN107506415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710687437.0A CN107506415B (en) 2017-08-11 2017-08-11 Large text high-order semantic tensorial classification method and system based on content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710687437.0A CN107506415B (en) 2017-08-11 2017-08-11 Large text high-order semantic tensorial classification method and system based on content

Publications (2)

Publication Number Publication Date
CN107506415A CN107506415A (en) 2017-12-22
CN107506415B true CN107506415B (en) 2020-07-21

Family

ID=60690788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710687437.0A Active CN107506415B (en) 2017-08-11 2017-08-11 Large text high-order semantic tensorial classification method and system based on content

Country Status (1)

Country Link
CN (1) CN107506415B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105913085A (en) * 2016-04-12 2016-08-31 中国科学院深圳先进技术研究院 Tensor model-based multi-source data classification optimizing method and system
CN106294568A (en) * 2016-07-27 2017-01-04 北京明朝万达科技股份有限公司 A kind of Chinese Text Categorization rule generating method based on BP network and system
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN106897437A (en) * 2017-02-28 2017-06-27 北明智通(北京)科技有限公司 The many sorting techniques of high-order rule and its system of a kind of knowledge system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060512B2 (en) * 2009-06-05 2011-11-15 Xerox Corporation Hybrid tensor-based cluster analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105913085A (en) * 2016-04-12 2016-08-31 中国科学院深圳先进技术研究院 Tensor model-based multi-source data classification optimizing method and system
CN106294568A (en) * 2016-07-27 2017-01-04 北京明朝万达科技股份有限公司 A kind of Chinese Text Categorization rule generating method based on BP network and system
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
CN106897437A (en) * 2017-02-28 2017-06-27 北明智通(北京)科技有限公司 The many sorting techniques of high-order rule and its system of a kind of knowledge system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于概念聚类的领域本体图中文文本分类;叶施仁等;《计算机工程》;20161215;第42卷(第12期);181-187 *

Also Published As

Publication number Publication date
CN107506415A (en) 2017-12-22

Similar Documents

Publication Publication Date Title
Vysotska et al. Web Content Support Method in Electronic Business Systems.
CN111539197B (en) Text matching method and device, computer system and readable storage medium
CN107291840B (en) User attribute prediction model construction method and device
KR101035038B1 (en) System and method for automatic generation of classifier for large data using of dynamic combination of classifier
CN109844742A (en) The analysis method, analysis program and analysis system of graph theory is utilized
CN112883190A (en) Text classification method and device, electronic equipment and storage medium
CN110659367B (en) Text classification number determination method and device and electronic equipment
CN106897437B (en) High-order rule multi-classification method and system of knowledge system
Ince et al. AHP-TOPSIS method for learning object metadata evaluation
CN116501898A (en) Financial text event extraction method and device suitable for few samples and biased data
CN114398557A (en) Information recommendation method and device based on double portraits, electronic equipment and storage medium
Guswandi et al. Sistem Pendukung Keputusan Pemilihan Calon Wali Nagari Menggunakan Metode TOPSIS
CN107506415B (en) Large text high-order semantic tensorial classification method and system based on content
Chen et al. Using latent Dirichlet allocation to improve text classification performance of support vector machine
Mandivarapu et al. Efficient document image classification using region-based graph neural network
CN107357909B (en) Efficient multi-backpack container winding system and winding method thereof
CN103902374B (en) Cellular automation and empowerment directed hypergraph based cloud-computing task scheduling method
CN110083654A (en) A kind of multi-source data fusion method and system towards science and techniques of defence field
CN115033699A (en) Fund user classification method and device
CN112989054B (en) Text processing method and device
Gorbushin et al. Automated intellectual analysis of consumers' opinions in the scope of internet marketing and management of the international activity in educational institution
CN113987126A (en) Retrieval method and device based on knowledge graph
CN113869024A (en) Method and system for generating initial guarantee scheme of airplane
CN113535946A (en) Text identification method, device and equipment based on deep learning and storage medium
CN112365189A (en) Case distribution method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: No. 601, floor 6, building 19, building 219, Huizhong Beili, Chaoyang District, Beijing 100012

Applicant after: Beijing Zhitong Yunlian Technology Co., Ltd

Address before: 100041, No. 7, building 2, No. 30, 49 Hing Street, Beijing, Shijingshan District

Applicant before: BEIMING SMARTECH (BEIJING) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant