CN116663534A

CN116663534A - Text data statistical analysis system and method based on natural language processing

Info

Publication number: CN116663534A
Application number: CN202310961991.9A
Authority: CN
Inventors: 孙兆洋; 隋媛; 张莺; 张荫芬; 李文武
Original assignee: China National Institute of Standardization
Current assignee: China National Institute of Standardization
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-08-29

Abstract

The invention provides a text data statistical analysis system and a text data statistical analysis method based on natural language processing, wherein the text data statistical analysis system comprises a natural language data screening system, a natural language data management system, a natural language data analysis system and a natural language chart visual management system, the natural language data screening system comprises a natural language data preprocessing module, a natural language data screening module and an information acquisition module, a directed graph consisting of a table and key words as nodes and ordered pairs as edges is automatically generated according to the statistical result and the user screening condition of the natural language chart visual management system, and the semantic matching module judges whether a natural language data query result matches a problem to be analyzed of a user by using a natural language processing technology. The invention can establish a knowledge base by utilizing a natural language processing technology and a knowledge graph technology, and then realize text data statistical analysis and text document data mining by utilizing an intelligent data analysis and visualization technology.

Description

Text data statistical analysis system and method based on natural language processing

Technical Field

The invention relates to the technical field of natural language analysis, in particular to a text data statistical analysis system and method based on natural language processing.

Background

With the continuous development of artificial intelligence technology, natural language semantic parsing and interaction technology is increasingly paid attention to. The current dialogue system has a corpus of the dialogue system per se aiming at a certain industry, and cannot carry out intelligent management and statistical calculation on data or relatively fix templates;

for large data centers, statistics are performed by reading text by manpower, so that a lot of time is consumed. Along with the explosive growth of the number of text documents, the manual work cannot meet the requirement of text data analysis, and the discretization storage of enterprise text documents is likely to cause that a large amount of important data information is lost without being mined, so that the waste of data resources is caused.

Therefore, the statistical analysis problem of the text document data of the enterprise is needed to be solved, and the key information is extracted to guide the production operation of the enterprise.

Disclosure of Invention

Aiming at the problems, the invention provides a text data statistical analysis system and a text data statistical analysis method based on natural language processing.

In order to solve the problems, the invention adopts the following technical scheme:

a text data statistical analysis system and method based on natural language processing comprises a natural language data screening system, a natural language data management system, a natural language data analysis system and a natural language chart visual management system;

the natural language data screening system comprises a natural language data preprocessing module, a natural language data screening module and an information acquisition module;

the natural language data management system comprises a natural language data construction module, a natural language data configuration module and a description semantic coding module;

the natural language data analysis system comprises a natural language understanding module, a natural language query module, a semantic matching module and a label semantic coding module;

the natural language chart visual management system is used for generating and visually displaying a data chart, the natural language chart visual management system provides a data chart generating template, reads the statistical result of the natural language chart visual management system according to the chart template, and automatically generates a directed graph consisting of a table and key words as nodes and ordered pairs as edges according to the statistical result of the natural language chart visual management system and user screening conditions;

the semantic matching module judges whether the natural language data query result matches the problem to be analyzed of the user or not by using a natural language processing technology, and the matched data needs to be included into statistics.

Preferably, the language data preprocessing module is used for preprocessing document texts, including corpus importing, format conversion and corpus cleaning.

Preferably, the natural language data screening module extracts document information from the document by using a natural language processing technology, provides knowledge data for a natural language data management system, and the information acquisition module is used for acquiring text description of data required to be accessed and called by a visitor and defining identity tag information of the visitor.

Preferably, the natural language data construction module is used for defining the field and the label of the natural language.

Preferably, the natural language data configuration module is used for configuring natural language data, establishing a mapping relation between the data and a map label, providing a data source for a subsequent natural language data analysis system, providing a visual function for a natural language data management system, adding, deleting and modifying natural language data, and processing the text description of the data required to be accessed and called by the visitor by the description semantic coding module, and then obtaining a resource description semantic feature vector through a semantic coder comprising an embedded layer.

Preferably, the natural language understanding module is connected with a user interaction interface to provide a user problem description template, and a user can input a problem to be analyzed according to the template in the user interaction interface, so that the problem to be analyzed by the user can be subjected to semantic extraction through a natural language processing technology based on the template and deep learning.

Preferably, the natural language query module queries and data statistics on the knowledge graph data by using a graph algorithm.

Preferably, the natural language query module includes a natural language algorithm, the natural language algorithm refers to a basic graph algorithm of a search algorithm class, natural language data query is performed according to the natural language algorithm, a query result is provided for a semantic matching module to judge, and the tag semantic coding module performs word segmentation processing on the identity tag information of the visitor and then obtains an identity tag semantic feature vector through the semantic encoder including an embedded layer.

Preferably, the natural language query module further comprises a natural language calculation function, wherein the natural language calculation function comprises basic statistical mathematical calculation of summation and difference calculation, and a statistical calculation result is used for being called by a natural language chart visualization management system.

A method of a text data statistical analysis system based on natural language processing, comprising the steps of:

s1, constructing a natural language data screening system, wherein the natural language data screening system comprises a natural language data preprocessing module, a natural language data screening module and an information acquisition module; constructing a natural language data management system, which comprises a natural language data construction module and a natural language data configuration module; constructing a natural language data analysis system, which comprises a natural language understanding module, a natural language query module and a semantic matching module; constructing a natural language chart visual management system;

s2, constructing a complete natural language data management system, completing the definition of the field and the tag data of the natural language, acquiring text description of the data required to be accessed and called by the visitor, and defining the identity tag information of the visitor;

s3, uploading natural language data to a corpus preprocessing module, and then carrying out corpus importing, format conversion and corpus cleaning on the natural language data;

s4, marking the natural language data, automatically extracting and importing the natural language data into a knowledge graph after marking, providing a data source for a subsequent natural language data analysis system, providing a visual function for a natural language data management system, and adding, deleting and modifying the natural language data;

s5, inputting the problem to be analyzed into a problem description template of the problem understanding module, and extracting semantics through a natural language processing technology based on the template and deep learning;

s6, generating a directed graph specifically as follows: the directed graph as=<M,N>The method comprises the steps of carrying out a first treatment on the surface of the Wherein the vertex set，/>For an i-th user to demand keywords or entities, i=1,..n; />The r pieces of data information table are represented, r=1, a. m; the edge set is defined as CS =>Data information table associated with the ith user requirement keyword or entity; when the key words correspond to several fields of the same data information table, the field with the maximum similarity is selected, and the key words are added with the key words>。

The beneficial effects of the invention are as follows:

1. a knowledge base is established by utilizing a natural language processing technology and a knowledge graph technology, text data statistical analysis is realized by utilizing an intelligent data analysis and visualization technology, text document data mining is realized, unified data management and association analysis of the text documents of the same type are realized by utilizing the natural language processing technology, the knowledge graph technology and a graph algorithm technology, knowledge graph expansion and updating can be performed, and meanwhile, data analysis results are updated correspondingly.

2. By acquiring the text description of the data required to be accessed and called by the visitor, the text description of the data required to be accessed and called by the visitor and the identity tag information of the visitor are respectively subjected to self-adaptive semantic understanding by using a semantic understanding model for natural language processing, a text document data analysis result is automatically generated and displayed in a chart visually, and the readability of the data analysis result is enhanced.

3. The self-learning escape is added to the fields of different industries, and professional vocabularies of different industries can be learned, so that the query result has better industry pertinence and stronger practicability.

Drawings

FIG. 1 is a system block diagram of the present invention;

fig. 2 is a flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a text data statistical analysis system and method based on natural language processing includes a natural language data screening system, a natural language data management system, a natural language data analysis system and a natural language chart visualization management system;

the natural language data analysis system comprises a natural language understanding module, a natural language query module, a semantic matching module and a tag semantic coding module;

Further, the language data preprocessing module is used for preprocessing document texts and comprises corpus importing, format converting and corpus cleaning.

Further, the natural language data screening module extracts document information by using a natural language processing technology, provides knowledge data for a natural language data management system, and the information acquisition module is used for acquiring text description of data required to be accessed and called by a visitor and defining identity tag information of the visitor.

Further, the natural language data construction module is used for defining the field and the label of the natural language.

Further, the natural language data configuration module is used for configuring natural language data, establishing a mapping relation between the data and the map label, providing a data source for a subsequent natural language data analysis system, providing a visual function for a natural language data management system, adding, deleting and modifying the natural language data, and processing text description of data required to be accessed and called by a visitor by the description semantic coding module, and obtaining a resource description semantic feature vector through a semantic coder comprising an embedded layer.

Further, the natural language understanding module is connected with the user interaction interface to provide a user problem description template, and a user can input the problem to be analyzed according to the template in the user interaction interface, so that the problem to be analyzed by the user can be subjected to semantic extraction through a natural language processing technology based on the template and deep learning.

Further, the natural language query module queries and data statistics on the knowledge graph data by using a graph algorithm.

Further, the natural language query module comprises a natural language algorithm technology, the natural language algorithm technology refers to a basic graph algorithm of a search algorithm class, natural language data query is carried out according to the natural language algorithm technology, a query result is used for judging by the semantic matching module, and the tag semantic coding module carries out word segmentation on identity tag information of a visitor and then obtains an identity tag semantic feature vector through a semantic encoder comprising an embedded layer.

Furthermore, the natural language query module also comprises a natural language calculation function, wherein the natural language calculation function comprises basic statistical mathematical calculation of summation and difference calculation, and the statistical calculation result is used for being called by a natural language chart visual management system.

Referring to fig. 2, a method of a text data statistical analysis system based on natural language processing includes the steps of:

In summary, the invention establishes a knowledge base by using a natural language processing technology and a knowledge graph technology, then realizes text data statistics analysis by using an intelligent data analysis and visualization technology, realizes text document data mining, realizes unified data management and association analysis of the text documents of the same type by using the natural language processing technology, the knowledge graph technology and a graph algorithm technology, can expand and update the knowledge graph, simultaneously correspondingly updates the data analysis result, respectively carries out self-adaptive semantic understanding on the text description of the data required to be accessed and invoked by the visitor and the identity tag information of the visitor by using a semantic understanding model for natural language processing by acquiring the text description of the data required to be accessed and invoked by the visitor, intuitively displays the text document data analysis result, enhances the readability of the data analysis result, increases the self-learning transfer for the fields of different industries, and can learn the specialized vocabulary of different industries, so that the query result has more industry pertinence and stronger practicability.

The formula in the invention is a formula which is obtained by removing dimension and taking the numerical calculation, and is closest to the actual situation by acquiring a large amount of data and performing software simulation, and the preset proportionality coefficient in the formula is set by a person skilled in the art according to the actual situation or is obtained by simulating the large amount of data.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The text data statistical analysis system based on natural language processing is characterized by comprising a natural language data screening system, a natural language data management system, a natural language data analysis system and a natural language chart visualization management system;

2. The system of claim 1, wherein the language data preprocessing module is used for preprocessing document text, including corpus importing, format conversion and corpus cleaning.

3. The system of claim 2, wherein the natural language data filtering module extracts document information from the document by using a natural language processing technology, provides knowledge data for the natural language data management system, and the information obtaining module is used for obtaining text descriptions of materials required to be accessed and called by the visitor and defining identity tag information of the visitor.

4. A text data statistical analysis system based on natural language processing according to claim 3, wherein the natural language data construction module is used for defining the field and the label of the natural language.

5. The system of claim 4, wherein the natural language data configuration module is configured to configure natural language data, and establish a mapping relationship between the data and a map label, to provide a data source for a subsequent natural language data analysis system, and the natural language data management system provides a visual function to perform addition, deletion and modification of the natural language data, and the description semantic coding module processes the text description of the data required to be accessed and invoked by the visitor, and then obtains a resource description semantic feature vector through a semantic encoder including an embedded layer.

6. The system of claim 5, wherein the natural language understanding module is connected to a user interaction interface to provide a user question description template, and a user can input a question to be analyzed according to the template in the user interaction interface, and the question to be analyzed by the user is semantically extracted by a natural language processing technology based on the template and deep learning.

7. The system of claim 6, wherein the natural language query module queries and data statistics on knowledge-graph data using a graph algorithm.

8. The text data statistical analysis system based on natural language processing according to claim 7, wherein the natural language query module comprises a natural language algorithm technology, the natural language algorithm technology refers to a basic graph algorithm of a search algorithm class, natural language data query is performed according to the natural language algorithm technology, a query result is used for judgment by a semantic matching module, and the tag semantic coding module performs word segmentation processing on the identity tag information of the visitor and then obtains an identity tag semantic feature vector through the semantic encoder comprising an embedded layer.

9. The system of claim 8, wherein the natural language query module further comprises a natural language calculation function, the natural language calculation function comprises a basic statistical mathematical calculation of a summation and difference class, and the statistical calculation result is used for being called by a natural language chart visualization management system.

10. A method for a natural language processing based text data statistical analysis system as claimed in any one of claims 1 to 9, comprising the steps of:

s6, generating a directed graph specifically as follows: the directed graph as=<M,N>The method comprises the steps of carrying out a first treatment on the surface of the Wherein the vertex set，/>For an i-th user to demand keywords or entities, i=1,..n; />The r pieces of data information table are represented, r=1, a. m; the edge set is defined as CS =>=1, 2,3, & gt, n, r=1, & gt, m, r is the i-th user demand keyword or entity associated data information table }; when the key words correspond to several fields of the same data information table, the field with the maximum similarity is selected, and the key words are added with the key words>。