CN109582965B

CN109582965B - Distributed platform construction method and system of semantic analysis engine

Info

Publication number: CN109582965B
Application number: CN201811456181.3A
Authority: CN
Inventors: 高岚
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2022-03-01
Anticipated expiration: 2038-11-30
Also published as: CN109582965A

Abstract

The invention relates to the technical field of big data processing, in particular to a distributed platform architecture method and a distributed platform architecture system of a semantic analysis engine. The method effectively solves the increasingly huge data processing amount and reduces the maintenance and updating cost. A distributed platform architecture method of a semantic analysis engine is characterized by comprising the following steps: receiving input user statement data; performing off-line training on the user statement data; and analyzing the user statement data in real time to obtain a semantic result. The real-time parsing system is used for parsing the user sentences in real time, and comprises sentence segmentation, extraction of the intention and vocabulary characteristics of the user sentences to understand the real meanings of the user sentences. The offline system is used to train the segmentation models and the intent extraction models needed in the real-time system.

Description

Distributed platform construction method and system of semantic analysis engine

Technical Field

The invention relates to the technical field of big data processing, in particular to a distributed platform architecture method and a distributed platform architecture system of a semantic analysis engine.

Background

The rapid development of AI artificial intelligence technology now makes the devices such as televisions, mobile phones, and sound equipment in human life become more and more intelligent. Voice interaction is an important skill, and semantic analysis technology in voice interaction can help a machine device to understand human language, which is a very important technology. Then, semantic analysis technology is required for each intelligent device having a voice interaction function. When a product with a voice interaction function is to be applied to a production environment, an important point to be considered is to estimate the data volume of the product to be analyzed, when the data volume is small, an offline single-edition semantic analysis processing engine can be directly arranged at an equipment terminal for processing, but when the data volume is large, a large data platform needs to be selected for processing so as to ensure better user experience.

For a single-machine platform, the maintenance and the updating are not very facilitated, and the problem can be solved by uniformly collecting the voice data of the user to analyze, process and feed back. But with the dramatic increase in the amount of data for the user, a large data platform must be used to handle it. At present, the distributed processing architecture of big data is more and more widely applied in various fields because it can process huge amount of data and the operation speed is greatly increased. It is therefore necessary to apply the distributed processing architecture approach of big data also in voice interaction technology.

Disclosure of Invention

The invention aims to provide a distributed platform architecture method and a distributed platform architecture system for a semantic analysis engine, which can process larger data processing capacity by using the technology of a distributed processing architecture of big data and reduce maintenance and updating cost to a certain extent.

The invention discloses a distributed platform architecture method of a semantic analysis engine in a first aspect, which comprises the following steps:

receiving input user statement data;

performing off-line training on the user statement data; and

analyzing the user statement data in real time to obtain a semantic result;

preferably, the process of training the user sentence data offline includes:

storing the input user statement data on a distributed system to generate training data, converting the training data into a distributed data set to enable the training data to be partitioned, training the partitioned training data according to a word partitioning format to obtain a CRF word partitioning model, and training according to the part of speech of a word to obtain a CRF part of speech model;

performing word segmentation on all the training data, calculating a d-dimensional vector for each word by the training data subjected to word segmentation through an unsupervised method to obtain a word vector, and further generating a word vector model;

building a bidirectional coding and decoding model based on a neural network, inputting the word vectors into the bidirectional coding and decoding model to train and learn to obtain the intention of input sentences, simultaneously segmenting all training data, converting the user data after segmentation processing into an elastic distributed data set, and inputting the bidirectional coding and decoding model to verify whether the intention is accurate so as to train an intention extraction model; and

and providing the word vector with a near-sense word and/or a label for each word by a method of querying a standard dictionary to generate a labeled near-sense word network.

Preferably, the process of analyzing the user statement data in real time includes:

calling a CRF word segmentation model trained in advance, segmenting the user statement data to be decomposed into a plurality of words, calling a CRF part-of-speech model trained in advance, and labeling part-of-speech of each word obtained by decomposition;

searching all the vocabularies marked with the parts of speech in a near-sense word network with labels obtained by pre-training, and finding all the labels relative to the vocabularies by combining each vocabulary with the corresponding part of speech marked by the vocabulary;

meanwhile, calling a pre-trained intention extraction model, and analyzing all vocabularies marked with parts of speech to obtain possible intention information of the current user sentence; and

and analyzing and obtaining a final semantic result by combining the intention information and the label word information.

The real-time parsing system is used for parsing the user sentences in real time, and comprises sentence segmentation, extraction of the intention and vocabulary characteristics of the user sentences to understand the real meanings of the user sentences. The offline system is used to train the segmentation models and the intent extraction models needed in the real-time system. The method effectively solves the increasingly huge data processing amount and reduces the maintenance and updating cost.

The second aspect of the present invention discloses a distributed platform architecture system of a semantic analysis engine, comprising:

the offline training system is configured to receive input user statement data and perform offline training on the user statement data; and

the real-time analysis system is configured to receive input user statement data and analyze the user statement data in real time to obtain a semantic result.

Preferably, the offline training system is configured to:

Preferably, the real-time parsing system is configured to invoke a pre-trained CRF word segmentation model, segment the user statement data to be decomposed into a plurality of words, and then invoke a pre-trained CRF part-of-speech model, and label part-of-speech for each word obtained by the decomposition;

The invention has the beneficial effects that:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a distributed system of semantic analysis engines according to an embodiment of the invention.

Detailed Description

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples thereof.

The technical solutions of the embodiments of the present invention will be described below with reference to the accompanying drawings.

In a first aspect of the present disclosure, a distributed platform architecture method for a semantic analysis engine is provided, and as shown in fig. 1, a processing flow connected by a lower part of solid lines is a real-time analysis technology framework of the semantic analysis engine, and a processing flow connected by an upper part of dotted lines is an offline training model data processing framework of the semantic analysis engine. The method comprises the following steps: receiving input user statement data; performing off-line training on the user statement data; and analyzing the user statement data in real time to obtain a semantic result.

Wherein the process of performing offline training on the user sentence data comprises:

the Spark-based LBFGS algorithm uses CRF-Spark to store the input user sentence data on HDFS, i.e. a distributed system, to generate training data, and converts the training data into a distributed data set so that the training data is blocked, e.g. converted into RDD using a textFile function, and executed in parallel on a cluster. Training the training data after being segmented according to a segmentation format to obtain a CRF segmentation model, wherein the format of the training data can be customized, and the segmentation format is (word, B/I/E/S), wherein (B/I/E/S) represents the beginning (B), middle (I), end (E) and single word (S) of the word, for example, as follows:

human being	B
		People	I
Net	E
		1	B
Moon cake	I
		1	I
Day(s)	E
		Information communication	S

Data can be trained by utilizing a train function during training, and the save function is called to store the model to a fixed position after the model is trained. Training the segmented training data according to the part of speech of the word to obtain a CRF part of speech model; the parts of speech are in the format of (word, part of speech), wherein the parts of speech are adjectives (adj), nouns (n), verbs (v), etc., as follows:

word	Part of speech of the word
		Advancing direction	v
Is filled with	v
		Hope for	n
Is/are as follows	u
		New	a
Century	n

Performing word segmentation on all the training data, calculating a d-dimensional vector for each word by the training data subjected to word segmentation through an unsupervised method to obtain a word vector, and further generating a word vector model; building a bidirectional LSTM coding and decoding model of a spark-based LSTM neural network, inputting the word vector into the bidirectional coding and decoding model to train and learn to obtain the intention of an input sentence, simultaneously segmenting all the training data, converting the user data after the segmentation processing into an elastic distributed data set, for example, converting a textFile function into RDD data, packaging the RDD data in a DataSet form to construct a final training data form RDD < DataSet >, inputting the bidirectional coding and decoding model, and training by using a train function to check whether the intention is accurate so as to train an intention extraction model.

And providing the word vector with a near-sense word and/or a label for each word by a method of querying a standard dictionary to generate a labeled near-sense word-word network, for example, as follows:

vocabulary and phrases	Label (R)	Word with similar meaning
			Play back	intent：play	Play back
Watch with	intent：play	Play back
			Check the	intent：search	Searching
Searching	intent：search	Searching
			To come	intent：recommend	Recommending
Recommending	intent：recommend	Recommending
			Downloading	intent：download	Downloading

Analyzing the user statement data in real time:

for example, the user statement from the terminal post, such as "how much the weather is today".

And analyzing the user statement based on the service of the springBoot framework.

Calling a CRF word segmentation model which is trained in advance, and segmenting the user sentence data to be decomposed into a plurality of words, such as:

"how much the weather is today".

Then calling a pre-trained part-of-speech model of CRF, and labeling part-of-speech for each vocabulary obtained by decomposition, such as:

"today: t (time word), weather: n (noun), how: ry (interrogatories) ".

Searching all the vocabularies marked with the parts of speech in a labeled near-sense word network obtained by pre-training, wherein each vocabulary finds all the labels relative to the vocabulary by combining the corresponding part of speech marked with the vocabulary, such as:

"today: day-0, weather: weather ".

Meanwhile, calling an intention extraction model which is trained in advance, and analyzing all vocabularies marked with parts of speech to obtain possible intention information of the current user sentence, such as:

"intention: query, field: weather ".

And finally, analyzing and obtaining a final semantic result by combining the intention information and the label word information, such as:

text how much the weather is today

Preferably, the offline training system is configured to:

The detailed working example process has already been elaborated in detail in the corresponding method, and is not described again.

Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims

1. A distributed platform architecture method of a semantic analysis engine is characterized by comprising the following steps:

receiving input user statement data;

performing off-line training on the user statement data; and

analyzing the user statement data in real time to obtain a semantic result;

building a bidirectional coding and decoding model based on a neural network, inputting the word vectors into the bidirectional coding and decoding model to train and learn to obtain the intention of input sentences, simultaneously segmenting all training data, converting user data after segmentation processing into an elastic distributed data set, and inputting the bidirectional coding and decoding model to check whether the intention is accurate so as to train an intention extraction model; and

providing a near-sense word and/or a label for each word by the word vector through a method of querying a standard dictionary to generate a labeled near-sense word network;

the process of analyzing the user statement data in real time comprises the following steps:

2. A distributed platform architecture system for a semantic analysis engine, comprising:

the real-time analysis system is configured to receive input user statement data and analyze the user statement data in real time to obtain a semantic result;

the offline training system is configured to:

the real-time analysis system is configured to call a CRF word segmentation model which is trained in advance, segment the user statement data to be decomposed into a plurality of words, then call a CRF part-of-speech model which is trained in advance, and label part-of-speech of each word obtained by decomposition;