CN116431806A

CN116431806A - Natural language understanding method and refrigerator

Info

Publication number: CN116431806A
Application number: CN202310292017.8A
Authority: CN
Inventors: 曾谁飞; 孔令磊; 李华刚; 张景瑞; 李敏; 刘卫强
Original assignee: Qingdao Haier Refrigerator Co Ltd; Qingdao Haier Smart Technology R&D Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Refrigerator Co Ltd; Qingdao Haier Smart Technology R&D Co Ltd; Haier Smart Home Co Ltd
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-07-14

Abstract

The invention provides a natural language understanding method and a refrigerator, wherein the method comprises the following steps: acquiring a labeling text corresponding to the text data; performing slot extraction on the marked text, and performing matching on a plurality of pieces of slot information obtained by extraction based on a preset slot database to obtain a plurality of matching results corresponding to the text data; and identifying and fusing the plurality of matching results to obtain understanding result information corresponding to the text data. According to the natural language understanding method, the slot extraction is performed on the marked text, the matching operation is performed on the extracted slot information based on the preset slot database, so that a matching result can be obtained quickly and efficiently, the timeliness of natural language understanding is improved, the accuracy of understanding result information is improved, and the user experience effect is improved.

Description

Natural language understanding method and refrigerator

Technical Field

The invention relates to the technical field of computers, in particular to a natural language understanding method and a refrigerator.

Background

The natural language understanding (Natural Language Understanding, abbreviated as NLU) is a part of natural language processing (Natural Language Processing, abbreviated as NLP), has wide application, and can be particularly applied to intelligent home appliances such as refrigerators, for example, the intelligent refrigerators and users can realize various human-computer interaction scenes such as intelligent retrieval, intelligent recommendation, intelligent question-answering and the like in a multi-mode data mode such as texts, voices, images and the like, and the human-computer interaction scenes are closely connected with natural language understanding methods which become key parts of artificial intelligence technology application.

The existing natural language understanding method comprises a rule and statistics-based method, wherein the intention of the natural language is judged by summarizing the data with the rule learning labels, or the model is built by adopting a single data label through traditional machine learning, so that the understanding of the natural language is realized.

The natural language understanding method has two layers of problems, namely, the information such as the semantics, the context, the grammar and the like of the deep data is not mined when multi-mode data are acquired, so that the understanding result is not necessarily accurate; secondly, when the semantic result information corresponding to the text data is matched, the matching speed is low and the efficiency is low. The problems are easy to cause inaccuracy, incompleteness and poor timeliness of the language understanding method, so that the intelligent refrigerator is easy to have slow user interaction, low accuracy of interaction feedback information, poor convenience and poor user experience effect in the process of interaction with the user.

Disclosure of Invention

The invention aims to provide a natural language understanding method, which aims to solve the technical problems that in the prior art, a single labeled text is used for constructing a model in isolation, the matching semantic result is slow, the expression content is single, and the original meaning of text data cannot be accurately and efficiently embodied.

An object of the present invention is to provide a refrigerator.

In order to achieve one of the above objects, the present invention provides a natural language understanding method, comprising: acquiring a labeling text corresponding to the text data; performing slot extraction on the marked text, and performing matching on a plurality of pieces of slot information obtained by extraction based on a preset slot database to obtain a plurality of matching results corresponding to the text data; identifying and fusing the plurality of matching results to obtain understanding result information corresponding to the text data; the text data comprises a text data body, a text data set and a text data set, wherein the text data body comprises a plurality of matching results, and the text data body comprises a plurality of text data, wherein the matching results represent characteristics of a plurality of elements in the text data in terms of word attributes and/or word meaning, and the understanding results represent text categories, emotion connotations and/or internal intentions of the whole text data body.

As a further improvement of an embodiment of the present invention, the text data includes voice text data, video text data, and text data.

As a further improvement of an embodiment of the present invention, before the "obtaining the labeling text corresponding to the text data", the method further includes: acquiring multi-source heterogeneous data and extracting data characteristics of the multi-source heterogeneous data; wherein the multi-source heterogeneous data includes at least one of voice data and video data; inputting the data characteristics into a convolutional neural network model to obtain corresponding first text information; and aligning the sequence lengths of the first text information and the data features by using a connection time sequence classification algorithm, and executing full connection combination calculation on the first text information to obtain the text data.

As a further improvement of an embodiment of the present invention, the "inputting the data feature into the convolutional neural network model to obtain the corresponding first text information" specifically includes: and inputting the data features into a multi-size multi-channel convolutional neural network model or a distillation diffusion model to obtain first text information conforming to the data features.

As a further improvement of an embodiment of the present invention, before the "extracting the data feature of the multi-source heterogeneous data", the method further includes: judging whether the multi-source heterogeneous data comprises voice data or not; the step of inputting the data features into a convolutional neural network model to obtain corresponding first text information specifically includes: if yes, inputting the voice characteristics into the multi-size multi-channel convolutional neural network model to obtain first text information corresponding to the voice data; wherein the data features comprise the speech features; and/or, prior to the "extracting the data features of the multi-source heterogeneous data", the method further comprises: judging whether the multi-source heterogeneous data comprises video data or not; the step of inputting the data features into a convolutional neural network model to obtain corresponding first text information specifically includes: if yes, inputting the video features into the distillation diffusion model to obtain first text information corresponding to the video data; wherein the data features comprise the video features.

As a further improvement of an embodiment of the present invention, the method further includes: and inputting the video data into a 3D depth convolution neural network model to obtain the video features.

As a further improvement of an embodiment of the present invention, the "obtaining the labeling text corresponding to the text data" specifically includes: performing pre-labeling on the text data based on a preset labeling rule to obtain first labeled text information, wherein the preset labeling rule characterizes labeling information corresponding to a plurality of professional field words and/or words with special meanings in the text data; executing text feature annotation on the first annotation text information to obtain the annotation text, wherein the text feature annotation comprises at least one of entity annotation, relation annotation, intention annotation, grammar annotation and emotion annotation; and executing annotation quality inspection on the annotation text, screening and storing the annotation text meeting the preset condition.

As a further improvement of an embodiment of the present invention, the "performing a labeling quality check on the labeling text, screening and storing the labeling text meeting a preset condition" specifically includes: counting and judging whether the accuracy of the text labels is greater than or equal to a preset threshold, wherein the accuracy characterizes the ratio of the correct label number to the total label number in the label text; if yes, the marked text is saved to a text marked database; and if not, re-executing the pre-labeling, the text feature labeling and the labeling quality inspection on the text data.

As a further improvement of an embodiment of the present invention, the "performing slot extraction on the labeled text, and performing matching on a plurality of slot information obtained by the extraction based on a preset slot database, to obtain a plurality of matching results corresponding to the text information" specifically includes: according to the slot marking information in the marking text, extracting each slot to obtain a plurality of slot information; inquiring the preset slot database according to the plurality of slot information, and judging whether each piece of slot information is matched with at least one piece of inquiry result; if yes, a matching result corresponding to each slot position information is obtained; if not, constructing a natural language understanding model according to the marked text, executing slot extraction on the marked text based on the natural language model, and executing matching on a plurality of pieces of slot information obtained by extraction to obtain a plurality of matching results corresponding to the text information.

As a further improvement of an embodiment of the present invention, before the step of "building a natural language understanding model from the markup text", the method further includes: and carrying out data set division on the marked text by adopting a k-fold cross validation method to obtain a training data set and a test data set.

As a further improvement of an embodiment of the present invention, the "constructing a natural language understanding model according to the markup text, and performing text feature prediction on the markup text based on the natural language model, to obtain a plurality of matching results corresponding to the text information" specifically includes: preprocessing the training data set, and extracting text features of the preprocessed training data to obtain initial features of each labeling text in the training data set; executing pooling operation and full-connection combined calculation on the initial characteristics to obtain text characteristics corresponding to the labeling text; inputting the text features into a pre-training language model to perform model training to obtain the natural language understanding model; and inputting the test data set into the natural language understanding model, and predicting a plurality of matching results corresponding to each labeling text in the test data.

As a further improvement of an embodiment of the present invention, the "obtaining the text feature corresponding to the labeling text by performing pooling operation and full-connection combined calculation on the initial feature" specifically includes: converting the local area data of the initial feature into a plurality of first matrixes; performing splicing operation on the first matrixes according to rows to obtain a second matrix; traversing the second matrix and sequentially obtaining the maximum value of each row of elements in the matrix to obtain a pooling feature matrix; and inputting the pooled feature matrix to a full-connection layer, calculating to obtain the score of the text feature information of the marked text, and screening to obtain the text feature corresponding to the marked text according to the score.

As a further improvement of an embodiment of the present invention, the method for constructing the natural language understanding model includes: machine learning methods, deep learning methods, knowledge-graph methods, fusion and integration methods, and large language model methods.

As a further improvement of an embodiment of the present invention, the "identifying and fusing the plurality of matching results to obtain the understanding result information corresponding to the text data" specifically includes: based on a dictionary or a preset mapping file, semantic text information corresponding to each matching result is obtained, and splicing is carried out on the semantic text information, so that understanding result information corresponding to the text data is obtained.

As a further improvement of an embodiment of the present invention, the method further includes: and converting the understanding result information corresponding to the text data into voice for output, and/or converting the understanding result information corresponding to the text data into voice for transmission to a client for output, and/or converting the understanding result information corresponding to the text data into text for transmission to the client.

In order to achieve one of the above objects, the present invention also provides a refrigerator including: a memory for storing executable instructions; and the processor is used for realizing the steps of any text representation method based on deep learning when executing the executable instructions stored in the memory.

Compared with the prior art, the embodiment of the invention has at least one of the following beneficial effects:

according to the invention, a natural language understanding method is adopted, and the matching operation is carried out on a plurality of pieces of slot information obtained by extraction on the basis of the preset slot database by executing slot extraction on the marked text, so that the matching result can be obtained rapidly and efficiently, and the timeliness of natural language understanding can be improved; meanwhile, through fusion of a plurality of matching results, the output understanding result information can be enabled to give consideration to the accuracy of content on the entity and the relation between the entity, so that the understanding capability of natural language is enhanced, the accuracy of language understanding is improved, and the user experience effect is improved.

Drawings

FIG. 1 is a schematic diagram of steps of a natural language understanding method according to an embodiment of the present invention.

Fig. 2 (a) is a schematic diagram of steps for determining and converting voice data into first text information according to an embodiment of the present invention.

Fig. 2 (b) is a schematic diagram of steps for determining video data and converting the video data into first text information according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of steps for obtaining a labeling text corresponding to text data according to an embodiment of the invention.

FIG. 4 is a schematic diagram of steps for screening and storing annotation text meeting annotation quality in an embodiment of the invention.

FIG. 5 is a schematic diagram illustrating steps for performing slot extraction and slot information matching on a markup text according to an embodiment of the present invention.

FIG. 6 is a schematic diagram of steps for constructing a language understanding model in an embodiment of the present invention.

FIG. 7 is a schematic block diagram of a preferred embodiment of a natural language understanding method in an embodiment of the present invention.

FIG. 8 is a data transformation diagram of a preferred embodiment of a natural language understanding method in accordance with an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments shown in the drawings. These embodiments are not intended to limit the invention and structural, methodological, or functional modifications of these embodiments that may be made by one of ordinary skill in the art are included within the scope of the invention.

It should be noted that the term "comprises," "comprising," or any other variation thereof is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The natural language understanding is that the machine is expected to have the ability of understanding the language like human beings, but the natural language has the characteristics of diversity, ambiguity, robustness, knowledge dependence and the like on a grammar structure and a semantic structure, so that the machine has deviation on the understanding of the natural language in the human-computer interaction process, the interaction effect is poor, and the accuracy of the understanding is low. Therefore, in the field of natural language processing, natural language understanding has important practical significance.

Based on this, the present invention provides a natural language understanding method, as shown in fig. 1, specifically including the following steps:

step S1, obtaining a labeling text corresponding to the text data.

And S2, performing slot extraction on the marked text, and performing matching on a plurality of pieces of slot information obtained by extraction based on a preset slot database to obtain a plurality of matching results corresponding to the text data.

And step S3, identifying and fusing the plurality of matching results to obtain understanding result information corresponding to the text data.

The text data comprises a text data body, a text data set and a text data set, wherein the text data body comprises a plurality of matching results, and the text data body comprises a plurality of text data, wherein the matching results represent characteristics of a plurality of elements in the text data in terms of word attributes and/or word meaning, and the understanding results represent text categories, emotion connotations and/or internal intentions of the whole text data body.

Therefore, based on a preset slot bit database, matching operation is performed on a plurality of extracted slot bit information, so that the matching result can be obtained quickly and efficiently, data sharing can be realized, and the redundancy of the matching result can be reduced, so that the safety and reliability of the data can be ensured; in addition, fusion operation is carried out on a plurality of matching results, so that the output understanding result information can be compatible with the accuracy of the content on the entity and the relationship between the entities, the understanding capability of natural language is enhanced, the accuracy of the understanding result information is improved, and the user experience effect is improved.

It should be emphasized that the preset slot database may be dynamically updated, and when the matching of the slot information fails, the slot information that fails to match may be updated to the preset slot database, so as to enrich the content in the database, and facilitate improving the success rate of the subsequent slot matching.

For step S1, the text data includes voice text data, video text data, and text data.

Therefore, through the multi-source heterogeneous data, on one hand, the multi-source can ensure the diversification of data sources, so that the acquired text data is more complete, comprehensive and reliable; on the other hand, heterogeneous information of different kinds can be connected together from the difference of the data structure, and the text content of the text data can be enriched from the aspects of multi-structure, multi-type and the like of the data.

It should be noted that the multi-source heterogeneous data may include multi-source data (i.e. source diversity of data), heterogeneous data (i.e. data with different storage structure types), or multi-source heterogeneous data, which is not particularly limited in the present invention.

The voice text data can refer to text contents which are transcribed into accordance with real-time voice data or offline voice data; the video text data can refer to text content which is transcribed into accordance with real-time video data or offline video data; the text data may include real-time text data and offline text data, and may specifically refer to text data related to user preferences, hobbies, interests, comments, or the like on food materials.

It will be appreciated that the text data may also include historical speech, video and text data for the purpose of enriching the training data set later to facilitate accurate construction of the natural language model, and the invention is not particularly limited in this regard.

Prior to step S1, the method may further comprise: acquiring multi-source heterogeneous data and extracting data characteristics of the multi-source heterogeneous data; inputting the data characteristics into a convolutional neural network model to obtain corresponding first text information; and aligning the sequence lengths of the first text information and the data features by using a connection time sequence classification algorithm, and executing full connection combination calculation on the first text information to obtain the text data.

Wherein the multi-source heterogeneous data comprises voice data, video data, or both voice data and video data.

Thus, the data characteristics of the multi-source heterogeneous data are transcribed into corresponding text information through the convolutional neural network model, and ambiguity caused by the fact that the text information has the same voice but contains different semantics is reduced. In addition, the speed of speaking or shooting video is faster than the typing speed, and the working efficiency can be improved by converting voice into text or converting video into text.

In one embodiment, the voice data is an inquiry or instruction statement or the like that a user currently speaks into the intelligent electronic device or into a terminal device or the like that is communicatively connected to the electronic device. Specifically, taking the intelligent refrigerator as an example, a user may speak an inquiry sentence such as "please ask the refrigerator for vegetables, or may issue an instruction sentence such as" please adjust the temperature of the refrigerating chamber to 2 degrees celsius ".

In another aspect, the video data consists essentially of sets of consecutive images. Specifically, a continuous sequence of images is acquired using a video acquisition device, such as a cell phone, camera, bluetooth, or the like acquisition device.

In a preferred embodiment, both aspects may be included, i.e. the acquired multi-source heterogeneous data includes both voice data and video data. Which can be separately data processed and text annotated.

For step S1, in one embodiment, user voice data may be collected by a voice collection device such as a sound pickup, a microphone array, etc. disposed in the intelligent refrigerator; the user video data can also be collected through video collecting devices such as a camera arranged in the refrigerator; text data may also be obtained via applets, public numbers or the Web. Furthermore, the text data may also include food material information of preference, favorites and interest of the user and some comment data of the user, such as "i have liked palace-chicken in the past", which covers user interest and hobbies, and may even contain information associated with the current text data.

In addition, the transmitted user real-time voice, video and text data and/or offline voice, video and text data can be acquired based on the client terminal connected by the wireless communication protocol. The client terminal is an electronic device with an information sending function, such as a mobile phone, a tablet computer, an intelligent sound device, an intelligent bracelet, a Bluetooth headset and other intelligent electronic devices. In the use, when the user needs to interact with the intelligent refrigerator, the user can directly send out voice or shoot video to the intelligent refrigerator, and the client terminal is transmitted to the intelligent refrigerator through wireless communication modes such as Wi-Fi or Bluetooth after collecting voice. In other embodiments of the present invention, one or more of the above data acquisition methods may be used, or the multi-source heterogeneous data may be acquired through other channels based on the prior art, which is not described herein in detail.

Further, the step of extracting the data features of the multi-source heterogeneous data may further comprise preprocessing the data prior to the step of extracting the data features of the multi-source heterogeneous data. Specifically, the speech is segmented according to a specified length (time period or number of samples), and the framing processing and windowing processing of the speech are completed. Furthermore, pre-emphasis processing can be performed before framing processing, and the high-frequency part of the voice is emphasized, so that the influence of lip radiation in the sounding process is eliminated, the high-frequency part of the voice signal suppressed by the sounding system is compensated, and the formants of the high frequency can be highlighted. And the steps of audio noise point filtering, voice enhancement and the like can be performed after the windowing processing, so that the enhancement of the voice data is completed, the characteristic parameters of the voice are extracted, and the voice data meets the input requirements of a follow-up neural network model.

Further, the video data may be cropped in a scripted manner or using a third party video cropping tool. Specifically, video data are loaded, video information is read, video is decoded according to the video information, and multi-frame images corresponding to the video data are obtained according to the width and the height of single-frame images.

Further, irrelevant data, repeated data, abnormal value and missing value data and the like in the text data set are deleted, and data cleaning and formatting processing is performed on the text data. And then, labeling the category labels of the text data based on a rule statistics method and the like, and performing word segmentation processing on the text data based on a word segmentation method based on character string matching, an understanding-based word segmentation method, a statistics-based word segmentation method, a rule-based word segmentation method and the like. And then, removing the stop words, and completing preprocessing of the text data, so that the text data meets the input requirements of a follow-up neural network model.

Further, in one embodiment, the python language may be used to write an audio/video separation script, or a third party audio/video separation tool may be used to perform a separation operation on data that includes both video data and audio data, so as to obtain effective voice data and video data.

The "inputting the data features into the convolutional neural network model to obtain the corresponding first text information" section may specifically include:

and inputting the data features into a multi-size multi-channel convolutional neural network model or a distillation diffusion model to obtain text content conforming to the data features.

Thus, the convolutional neural network model can comprise multiple types, the corresponding neural network model can be adjusted corresponding to different data characteristics, the adaptability to the data characteristics can be improved, and the generalized recognition capability of an algorithm can be improved. In one embodiment, the method can be further configured to adaptively select a suitable neural network model for processing according to the type of the data feature, so as to improve the working efficiency and accuracy of converting the data feature into text.

It should be emphasized that, for the data of different data types or different structure types, in order to improve the accuracy and adaptivity of feature extraction, the invention provides a refinement operation on the data types of the multi-source heterogeneous data so as to fully exert the advantages of two neural network models.

Specifically, in the first embodiment, as shown in fig. 2 (a), the method may include step S01A before the "extracting the data features of the multi-source heterogeneous data"; the "inputting the data features into the convolutional neural network model to obtain the corresponding first text information" may further specifically include step S02A corresponding to step S01A. The step S01A and the step S02A specifically include:

And step S01A, judging whether the multi-source heterogeneous data comprises voice data or not.

If yes, jumping to the step S02A, inputting the voice characteristics into the multi-size multi-channel convolutional neural network model to obtain first text information corresponding to the voice data; wherein the data features include voice features.

Therefore, the voice data is processed by utilizing the multi-size multi-channel convolution neural network model, the channel and the size in the neural network model can be adaptively selected according to the characteristics of the voice data, the adaptability to the voice-to-text can be improved, and the accuracy of the voice-to-text can be improved.

In a second embodiment, as shown in fig. 2 (B), the method may further include a step S01B before the "extracting the data features of the multi-source heterogeneous data"; the "inputting the data features into the convolutional neural network model to obtain the corresponding first text information" may further specifically include step S02B corresponding to step S01B. The step S01B and the step S02B specifically are:

and step S01B, judging whether the multi-source heterogeneous data comprises video data or not.

If yes, jumping to the step S02B, and inputting video features into the distillation diffusion model to obtain first text information corresponding to the video data; wherein the data characteristic comprises the video characteristic.

Therefore, the distillation diffusion model is utilized to process the video data, the rapid sampling of the video data can be realized by changing the weight information in the model, and the time and cost for converting the video into the text are reduced; in addition, by using the distillation diffusion model, an image equivalent to the original model can be visually generated, the trade-off between sample diversity and quality can be carried out, and the timeliness and the accuracy of the video data transfer text can be improved.

In a preferred embodiment, the two embodiments may be combined, that is, when the text data includes both voice data and video data, the text data may be processed using different neural network models and converted into corresponding text information.

In practical applications, considering that the sentences identified by the image text are complex, such as different sentence lengths, different sentence pause positions or different word compositions, and the correlation between the image features thereof, the "video feature" portion in step S02B may specifically include:

and inputting the video data into a 3D depth convolution neural network model to obtain the video features.

Therefore, the time sequence relation and the motion relation among the multi-frame images in the video data can be captured better through the 3D convolutional neural network model, and the accuracy of video feature extraction is improved.

Specifically, in one embodiment, video processing operations such as clipping and framing are performed on the video data, video images of a local area are obtained, clipping and segmentation are performed on the video images of the local area, so as to obtain a plurality of continuous local area image frames, the continuous local area image frames are input into a 3D convolutional neural network model, more expressive features can be extracted by adding time dimension information, the 3D convolutional neural network model can solve the correlation information between a plurality of images, continuous multi-frame images are taken as input, and motion information in the input frames is captured by adding new dimension information, so that the image features of the video images can be better obtained.

After the multi-source heterogeneous data is acquired, a data feature of the multi-source heterogeneous data may be enhanced using an attention mechanism model. Preferably, the above-described operations may be performed on voice data to increase the speed and effect of the conversion. Based on this, the present invention may further comprise the steps of: based on the attention mechanism model, speech features of the speech data are enhanced. It will be appreciated that this step may be located anywhere after "acquire multi-source heterogeneous data" to achieve a corresponding effect.

In a preferred embodiment, the solution may also be combined with the embodiment comprising step S01A and step S02A, where the step of enhancing the speech feature using the attention mechanism model may be specifically arranged after step S02A. In addition, the above steps are not limited to the scenario where the multi-source heterogeneous data includes voice data, and the present invention does not exclude the process of implementing the attention mechanism model on other kinds of multi-source heterogeneous data.

Therefore, by introducing an attention mechanism, the multi-source heterogeneous data or the voice data can be focused on the related characteristic or weight information converted into the first text information, and the irrelevant characteristic or weight information is ignored, so that the speed and effect of converting the voice into the text can be further improved, and the calculation time and cost are reduced.

Specifically, in the multi-size multi-channel convolutional neural network model, an attention mechanism is introduced, so that the neural network model can not only autonomously learn the attention mechanism, pay attention to key features or related features of the voice data, but ignore other non-key or non-related feature information, and enhance the voice features; and the attention mechanism can be used for helping the computer to understand the neural network in turn, and the two complement each other, so that the efficiency and the effect of converting voice into text are improved.

In other embodiments of the present invention, the enhancement of the speech features of the speech data may be accomplished by other algorithm models, which the present invention is not limited to.

The multi-size multi-channel convolutional neural network model in step S02A is composed of a multi-layer deep convolutional neural network model, which generally comprises a plurality of convolutional layers plus a plurality of fully-connected layers, and may further comprise a plurality of nonlinear operations, pooling operations, etc. The model is to extract the voice characteristics of the voice data, then calculate the text characteristics according to the voice characteristics, and convert the voice characteristics into corresponding text information.

Therefore, the voice data is calculated by calculating the voice features instead of calculating the voice data, the calculated amount is small, and the advantages of local features are easy to characterize. Secondly, through the attention mechanism and the pooling operation, better time domain or frequency domain invariance can be given to the model, and in addition, the deeper nonlinear structure can also enable the model to have strong text characterization capability. For the distillation diffusion model in step S02B, it is a type of diffusion model by which the underlying data distribution can be drawn given a set of unknown data.

In particular, the diffusion model may be divided into a forward diffusion process and a backward diffusion process. The distribution of the data is first systematically disturbed by a forward diffusion process, specifically gaussian noise is added to the data according to a pre-designed noise schedule until the distribution of the data tends to a priori, i.e. a standard gaussian distribution. And then the distribution of the data is restored through a learning backward diffusion process, namely, the original data distribution is gradually restored through learning by using a parameterized Gaussian conversion kernel from a given prior distribution.

Further, given a trained instruction model, i.e., a teacher model, the distillation diffusion model may include two steps, in a first step, introducing a continuous-time student model to match the combined output of the two teacher diffusion models to obtain a distillation model; in the second step, the distillation model trained in the first step is gradually converted into a model with fewer steps. Of course, other distillation algorithm models may be employed, without specific limitation.

In practical application, due to the fact that the voice speed and the video shooting frame rate of people are different or the character spacing is different, the multi-source heterogeneous data and the first text information are difficult to align on the unit of a word, and the connection time sequence classification algorithm is adopted, so that the mapping relation between the multi-source heterogeneous data and the first text information can be constructed.

Specifically, the connection timing classification algorithm (Connectionist Temporal Classification, CTC) is an end-to-end training algorithm that allows the network model to automatically learn the alignment of input and output sequence lengths. Preferably, the connection timing classification algorithm may be used after using a convolutional network model. The algorithm does not need to perform label alignment and labeling processing on input data in advance, but trains according to an input sequence and an output sequence, and outputs a mapping relation.

In one embodiment of the invention, the algorithm aligns the sequence length of the data feature (input sequence) and the first text information (output sequence), establishes a mapping relation between the multi-source heterogeneous data and the first text information, and can infer the most probable text content according to the mapping relation, thereby improving the accuracy of converting the multi-source heterogeneous data into the text, shortening the conversion time and reducing the calculation complexity.

It should be noted that, for the speech feature in step S02A, mel-frequency cepstrum coefficient feature (Mel-scale Frequency Cepstral Coefficients, MFCC for short) may be included. The MFCC is a component with identification in a voice signal, is a cepstrum parameter extracted in a Mel scale frequency domain, the preprocessed voice data is subjected to Fourier transform to obtain an energy spectrum of a multi-frame voice data signal, the energy spectrum is subjected to smoothing treatment on the frequency spectrum through a group of Mel scale triangular filter banks, harmonic waves are eliminated, and further, the MFCC coefficient characteristics are obtained after logarithmic operation and discrete cosine transform. Wherein the Mel scale describes the non-linear characteristics of the human ear frequency, and the parameters of the MFCC take into account the degree of perception of the human ear on different frequencies, which is particularly suitable for speech recognition and speaker recognition.

It should be emphasized that the speech feature may also use a perceptual linear prediction feature (Perceptual Linear Predictive, abbreviated PLP) or a linear prediction coefficient feature (Linear Predictive Coding, abbreviated LPC) instead of the MFCC feature, and may be specifically selected based on the actual model parameters and the field of application of the method, which is not specifically limited in the present invention.

In order to facilitate understanding of the text data, after obtaining the text data, the present invention performs a labeling operation on the text data, specifically, as shown in fig. 3, step S1 may specifically include:

and S11, performing pre-labeling on the text data based on a preset labeling rule to obtain first labeled text information, wherein the preset labeling rule characterizes labeling information corresponding to a plurality of professional field words and/or words with special meanings in the text data.

And step S12, executing text feature annotation on the first annotation text information to obtain the annotation text, wherein the text feature annotation comprises at least one of entity annotation, relation annotation, intention annotation, grammar annotation and emotion annotation.

And S13, performing annotation quality inspection on the annotation text, screening and storing the annotation text meeting the preset condition.

Therefore, the text data is labeled in terms of semantics, structure, context, intention, emotion and the like by executing pre-labeling, text feature labeling and labeling quality inspection operations on the text data, so that a huge text data set is created, subsequent slot extraction and language understanding model training are facilitated, and the accuracy of semantic understanding of the text data is improved.

For step S11, the preset labeling rules may specifically refer to a rule set including a plurality of preset labeling rules, and may be stored in a file or dictionary format. The preset labeling rules may be for specific text content that cannot be achieved by the ordinary labeling rules. For example, "casadi" performs word segmentation operation on a sentence when the sentence is marked by using a traditional marking tool, and the sentence may be divided into three parts, namely "card", "salsa" and "di", but the word is a brand name of a refrigerator and is a special text. When the text labeling is executed, word segmentation operation is not needed, and the pre-labeling operation can be directly executed according to the preset labeling rule, so that labeling errors of special texts in the professional field and the like are avoided.

For step S12, the labels may be classified into on-line labels and off-line labels according to the requirements of the project. In particular, the online labeling may refer to uploading the text data to a data processing platform and performing a labeling operation through the internet. The offline labeling may refer to labeling operations by an offline gadget or offline text (e.g., excel, txt, etc.).

The text feature labels may include at least one of entity labels, relationship labels, intent labels, grammar labels, and emotion labels. Specifically, the entity labeling may refer to extracting an entity from the text data for labeling; the relation annotation can make important annotations for syntactic association and semantic association of complex sentences; the intention labeling can refer to dividing labeling of intention or purpose of text data, including request, command, reservation, recommendation and the like; the emotion marking may refer to marking keywords or keywords in text data that are favored, sensitive, etc. by determining emotion contained in the text, such as three-level emotion marking (positive, neutral, negative).

Further, as shown in fig. 4, the method of the present invention refines the "execute labeling quality inspection on the labeling text" part of the labeling text in step S13, where the labeling text part meets the preset condition is screened and stored, and may specifically include:

Step S131, statistics and judgment are carried out to judge whether the accuracy of the text labels is greater than or equal to a preset threshold.

If yes, step S132 is skipped, and the marked text is saved to a text marked database.

If not, step S133 is skipped, and the pre-labeling, the text feature labeling and the labeling quality inspection are re-executed on the text data.

And the correct rate characterizes the ratio of the correct labeling number to the total labeling number in the labeling text. Specifically, the judging the correct number of the text labels may include: and checking the related labeling data in the labeling text according to the predefined labeling rules and requirements, and determining the correct labeling number conforming to the labeling rules.

Therefore, the operations such as pre-marking, text feature marking, marking quality inspection, text marking storage and the like are repeatedly executed on the text data, so that the marking time is reduced, and the marking efficiency and the marking accuracy of the text data are improved.

In order to achieve fast slot extraction and matching, the present invention refines the step S2, as shown in fig. 5, the step S2 may specifically include:

and S21, executing extraction operation on each slot according to the slot marking information in the marking text to obtain a plurality of slot information.

Step S22, inquiring the preset slot database according to the plurality of slot information, and judging whether each piece of slot information is matched with at least one piece of inquiry result.

If yes, step S23 is skipped to obtain a matching result corresponding to each slot information.

If not, a step S24 is skipped, a natural language understanding model is built according to the marked text, and text feature prediction is executed on the marked text based on the natural language model, so that a plurality of matching results corresponding to the text information are obtained.

Therefore, the method is simple and easy to realize, has high matching speed and high accuracy, and is beneficial to improving the accuracy of slot extraction based on the annotation text and carrying out slot matching inquiry based on a preset slot database; in addition, a training model is built based on the labeling text, so that the automation degree of the subsequent slot extraction and matching is high, and the error is small.

The slot information is an abstraction of relevant annotation information in the text data, for example, when a user asks about what is the case in Beijing weather today, the slot information of location and date can be abstracted according to the weather today and the weather, and slot extraction can be performed on the slot information.

Before the step of constructing a natural language understanding model from the markup text according to the step of S24, the method may further include:

and carrying out data set division on the marked text by adopting a k-fold cross validation method to obtain a training data set and a test data set.

Therefore, the data set is divided by the cross verification method, each sub-sample is guaranteed to participate in training and is tested, generalization errors are reduced, the occurrence of over-fitting and under-fitting states can be effectively avoided, the cross thought is fully reflected, and the data set dividing result is more persuasive and has better stability.

The k-fold cross verification method is characterized in that the text data are divided into k data subsets with equal size; traversing the k subsets in sequence, taking the current subset as a verification data set each time, taking the rest as a sample training data set, and training and evaluating a model; and finally taking the average value of k times of evaluation indexes as a final evaluation index, wherein k is generally 10, and can be adaptively adjusted according to actual conditions, so that the invention is not particularly limited.

When the quick matching of the slot information fails, the method can utilize the labeling text to construct a training model. Specifically, as shown in fig. 6, step S24 may specifically include:

And S241, preprocessing the training data set, and extracting text characteristics of the preprocessed training data to obtain initial characteristics of each marked text in the training data set.

Step S242, executing pooling operation and full-connection combination calculation on the initial characteristics to obtain text characteristics corresponding to the labeling text;

step S243, inputting the text feature to a pre-training language model to perform model training, so as to obtain the natural language understanding model.

Step S244, inputting the test data set into the natural language understanding model, and predicting text characteristics corresponding to each labeling text in the test data to obtain a plurality of matching results corresponding to the text information.

Thus, the natural language understanding model is built based on the marked text data, so that the time and cost for processing training sample data are reduced, and the model is more close to the actual situation; and the slot position extraction and the slot position information matching are realized by using the trained model, the slot position extraction and the matching are rapid, convenient, high in intelligent degree, high in efficiency and accuracy, and strong in universality and popularization.

The preprocessing specifically comprises standardized processing of numerical characteristics in the training data set, dimension elimination and unification of the value ranges of different characteristic items, for example, data labeling is achieved by adopting Min-Max or Z-Score methods; judging whether the value is missing or not and repeating the value; and performing coding operation and the like on the non-numerical type characteristics so that the subsequent text characteristic extraction is more accurate. The "after the initial feature is subjected to the combined calculation processing of the pooling operation and the full connection operation" in step S242 may specifically include:

Converting the data of the initial characteristic local area into a plurality of first matrixes; performing splicing operation on the first matrixes according to rows to obtain a second matrix; traversing the second matrix and sequentially obtaining the maximum value of each row of elements in the matrix to obtain a pooling feature matrix; and inputting the pooled feature matrix to a full-connection layer, calculating to obtain the score of the text feature information of the marked text, and screening to obtain the text feature corresponding to the marked text according to the score. Thus, the space size of each initial feature input can be reduced through the pooling operation, so that the calculated amount is reduced, the consumption of calculation resources is reduced, and the overfitting problem generated by full-connection calculation is avoided.

The method for constructing the natural language understanding model comprises the following steps: the machine learning method, the deep learning method, the knowledge graph method, the fusion and integration method, and the large language model method are not particularly limited.

The large language model is a natural language processing technology based on machine learning, and a model capable of understanding human language and automatically generating language is built through learning of a large-scale corpus. This model may be used to accomplish a variety of natural language processing tasks such as automatic question-answering, machine translation, speech recognition, text generation, and the like. The core idea of the large language model is to train a massive corpus by using a deep learning algorithm, so that a machine can learn the rules and characteristics of human language, and further, the automation of natural language processing is realized.

For the "identifying and fusing the plurality of matching results to obtain the understanding result information corresponding to the text data" in step S3, the method specifically may include: based on a dictionary or a preset mapping file, semantic text information corresponding to each matching result is obtained, and splicing is carried out on the semantic text information, so that understanding result information corresponding to the text data is obtained.

In this way, the semantic text information is stored in the form of files or dictionaries, and the method is simple, easy to configure and manage, and convenient for post maintenance of the items and reusability of new item related programs or files. In addition, the semantic text information is spliced, so that a user can more accurately know the original meaning expressed by the text data.

To facilitate the user to obtain the text representation information, the method may further include: and converting the understanding result information corresponding to the text data into voice for output, and/or converting the understanding result information corresponding to the text data into voice for transmission to a client for output, and/or converting the understanding result information corresponding to the text data into text for transmission to the client.

Therefore, by matching the multi-channel multi-mode data source acquisition mode provided by the invention, a user can directly interact with the intelligent device remotely, the method has extremely high convenience and greatly improves the user experience. In other embodiments of the present invention, only one or more of the above-described output modes of the understanding result information may be used, or the understanding result information may be output through other channels based on the prior art, which is not particularly limited by the present invention.

The various embodiments, examples or specific examples provided herein may be combined with one another to ultimately form a plurality of preferred embodiments.

For example, a schematic block diagram of a natural language understanding method is shown in fig. 7 in a preferred embodiment. Fig. 8 correspondingly shows the conversion process of related multi-source heterogeneous data or text data involved in the natural language understanding method when the preferred embodiment is executed. The processing of the preferred embodiment will be summarized below in connection with fig. 7 and 8.

First, multi-source heterogeneous data is acquired through multiple channels to obtain multiple types of data 101. Wherein the plurality of types of data 101 specifically include text data, voice data, and video data; the plurality of types of data 101 are subjected to text transcription operations, such as performing data preprocessing, feature extraction, and the like on voice and/or video data, and are transcribed into corresponding text contents to generate corresponding text data 102.

Labeling operation is performed on the text data 102 generated after transcription, corresponding labels are marked according to text features of the text data to realize text labeling, and the operations of pre-labeling, text feature labeling, labeling quality inspection, labeling text storage and the like can be performed on the text data to obtain labeled text 103.

And executing slot extraction operation on the marked text 103 to obtain a plurality of pieces of slot information 104.

And executing slot information query operation on the extracted slot information based on a preset slot database, judging whether each piece of slot information is matched with at least one piece of query result, if so, constructing a natural language understanding model, and executing operation extraction and slot matching on the marked text based on the natural language understanding model to obtain a plurality of matching results 105 corresponding to the text data.

And calculating a fusion result according to the plurality of matching results 105 to obtain the understanding result information 106.

The present invention also provides a refrigerator including: a memory for storing executable instructions; and the processor is used for realizing the steps of any natural language understanding method when executing the executable instructions stored in the memory.

In summary, the multi-source heterogeneous data is obtained through multiple channels, the multi-source heterogeneous data is subjected to transcription operation, converted into text data and subjected to text labeling, and labeled text corresponding to the text data is obtained; and then executing slot extraction and slot information matching operation on the marked text to obtain a plurality of matching results corresponding to the text data, and identifying and fusing the plurality of matching results to generate understanding result information corresponding to the text data. The method is based on the preset slot database for matching, so that the matching result can be obtained quickly and efficiently, data sharing can be realized, and the redundancy of the matching result can be reduced, so that the safety and reliability of the data can be ensured; and by performing splicing on a plurality of matching results, the output understanding result information can give consideration to information in multiple aspects such as accuracy of contents on entities, relations among entities, emotion expression among entities and the like, thereby being beneficial to enhancing understanding ability of natural language, improving accuracy of the understanding result information and improving user experience effect.

In addition, the labeling text can be further used as a training data set and a testing data set for constructing the language understanding model, so that the cost of relevant data processing such as text labeling is reduced, the model training efficiency is improved, the labeling text is subjected to slot extraction and slot information matching by using the model, and the matching result has strong rationality, high efficiency, high accuracy and high reliability.

It should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is for clarity only, and that the skilled artisan should recognize that the embodiments may be combined as appropriate to form other embodiments that will be understood by those skilled in the art.

The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the spirit of the present invention should be included in the scope of the present invention.

Claims

1. A natural language understanding method, comprising:

acquiring a labeling text corresponding to the text data;

performing slot extraction on the marked text, and performing matching on a plurality of pieces of slot information obtained by extraction based on a preset slot database to obtain a plurality of matching results corresponding to the text data;

identifying and fusing the plurality of matching results to obtain understanding result information corresponding to the text data;

2. The natural language understanding method of claim 1, wherein the text data comprises voice text data, video text data, and text data.

3. The natural language understanding method according to claim 1, wherein before the "obtaining the markup text corresponding to the text data", the method further comprises:

acquiring multi-source heterogeneous data and extracting data characteristics of the multi-source heterogeneous data; wherein the multi-source heterogeneous data includes at least one of voice data and video data;

inputting the data characteristics into a convolutional neural network model to obtain corresponding first text information;

and aligning the sequence lengths of the first text information and the data features by using a connection time sequence classification algorithm, and executing full connection combination calculation on the first text information to obtain the text data.

4. A natural language understanding method according to claim 3, wherein said inputting the data features into a convolutional neural network model to obtain the corresponding first text information comprises:

And inputting the data features into a multi-size multi-channel convolutional neural network model or a distillation diffusion model to obtain first text information conforming to the data features.

5. The natural language understanding method of claim 4, wherein prior to the extracting the data features of the multi-source heterogeneous data, the method further comprises:

judging whether the multi-source heterogeneous data comprises voice data or not;

the step of inputting the data features into a convolutional neural network model to obtain corresponding first text information specifically includes:

if yes, inputting the voice characteristics into the multi-size multi-channel convolutional neural network model to obtain first text information corresponding to the voice data; wherein the data features comprise the speech features;

and/or, prior to the "extracting the data features of the multi-source heterogeneous data", the method further comprises:

judging whether the multi-source heterogeneous data comprises video data or not;

if yes, inputting the video features into the distillation diffusion model to obtain first text information corresponding to the video data; wherein the data features comprise the video features.

6. The natural language understanding method of claim 5, further comprising:

7. The natural language understanding method according to claim 1, wherein the step of acquiring the markup text corresponding to the text data comprises:

performing pre-labeling on the text data based on a preset labeling rule to obtain first labeled text information, wherein the preset labeling rule characterizes labeling information corresponding to a plurality of professional field words and/or words with special meanings in the text data;

executing text feature annotation on the first annotation text information to obtain the annotation text, wherein the text feature annotation comprises at least one of entity annotation, relation annotation, intention annotation, grammar annotation and emotion annotation;

and executing annotation quality inspection on the annotation text, screening and storing the annotation text meeting the preset condition.

8. The natural language understanding method according to claim 7, wherein the step of performing a labeling quality check on the labeling text, and screening and storing the labeling text satisfying a preset condition specifically comprises:

Counting and judging whether the accuracy of the text labels is greater than or equal to a preset threshold, wherein the accuracy characterizes the ratio of the correct label number to the total label number in the label text;

if yes, the marked text is saved to a text marked database;

and if not, re-executing the pre-labeling, the text feature labeling and the labeling quality inspection on the text data.

9. The natural language understanding method according to claim 1, wherein the step of performing slot extraction on the markup text and performing matching on a plurality of slot information obtained by the extraction based on a preset slot database to obtain a plurality of matching results corresponding to the text information specifically comprises:

according to the slot marking information in the marking text, extracting each slot to obtain a plurality of slot information;

inquiring the preset slot database according to the plurality of slot information, and judging whether each piece of slot information is matched with at least one piece of inquiry result;

if yes, a matching result corresponding to each slot position information is obtained;

if not, constructing a natural language understanding model according to the marked text, and executing text feature prediction on the marked text based on the natural language model to obtain a plurality of matching results corresponding to the text information.

10. The natural language understanding method of claim 9, wherein prior to the step of constructing a natural language understanding model from the markup text, the method further comprises:

11. The method according to claim 10, wherein the steps of constructing a natural language understanding model from the markup text and performing text feature prediction on the markup text based on the natural language model, and obtaining a plurality of matching results corresponding to the text information include:

preprocessing the training data set, and extracting text features of the preprocessed training data to obtain initial features of each labeling text in the training data set;

executing pooling operation and full-connection combined calculation on the initial characteristics to obtain text characteristics corresponding to the labeling text;

inputting the text features into a pre-training language model to perform model training to obtain the natural language understanding model;

inputting the test data set into the natural language understanding model, and predicting text characteristics corresponding to each labeling text in the test data to obtain a plurality of matching results corresponding to the text information.

12. The natural language understanding method of claim 11, wherein the step of obtaining the text feature corresponding to the markup text by performing a pooling operation and a full-connection combined calculation on the initial feature specifically comprises:

converting the local area data of the initial feature into a plurality of first matrixes;

performing splicing operation on the first matrixes according to rows to obtain a second matrix;

traversing the second matrix and sequentially obtaining the maximum value of each row of elements in the matrix to obtain a pooling feature matrix;

and inputting the pooled feature matrix to a full-connection layer, calculating to obtain the score of the text feature information of the marked text, and screening to obtain the text feature corresponding to the marked text according to the score.

13. The natural language understanding method of claim 9, wherein the method of constructing the natural language understanding model comprises: machine learning methods, deep learning methods, knowledge-graph methods, fusion and integration methods, and large language model methods.

14. The natural language understanding method according to claim 1, wherein the step of identifying and fusing the plurality of matching results to obtain the understanding result information corresponding to the text data specifically comprises:

Based on a dictionary or a preset mapping file, semantic text information corresponding to each matching result is obtained, and splicing is carried out on the semantic text information, so that understanding result information corresponding to the text data is obtained.

15. The natural language understanding method of claim 1, wherein the method further comprises:

converting the understanding result information corresponding to the text data into voice for output, and/or

Converting the understanding result information corresponding to the text data into voice to be transmitted to a client for output, and/or

Converting the understanding result information corresponding to the text data into text for output, and/or

And converting the understanding result information corresponding to the text data into text and transmitting the text to the client.

16. A refrigerator, comprising: a memory for storing executable instructions; a processor for implementing the steps of the natural language understanding method of any one of claims 1 to 15 when executing executable instructions stored in said memory.