WO2023222089A1

WO2023222089A1 - Item classification method and apparatus based on deep learning

Info

Publication number: WO2023222089A1
Application number: PCT/CN2023/095081
Authority: WO
Inventors: 曾谁飞; 孔令磊; 张景瑞; 刘卫强; 李敏
Original assignee: 青岛海尔电冰箱有限公司; 海尔智家股份有限公司
Priority date: 2022-05-20
Filing date: 2023-05-18
Publication date: 2023-11-23
Also published as: CN114944156A

Abstract

The present invention provides an item classification method and apparatus based on deep learning. The method comprises the steps of: obtaining real-time speech data containing item information, and obtaining historical text data; transcribing the real-time speech data into speech text data, and extracting text features of speech text data; extracting text features of the historical text data; jointly representing the real-time speech data text features and the historical text data text features to obtain joint features; combining the joint features via a fully connected layer, outputting same to a classifier, calculating a score, and obtaining classification result information, and determining item category information; and outputting item category information.

Description

Item classification method and device based on deep learning

Technical field

The present invention relates to the field of computer technology, and in particular to an item classification method and device based on deep learning.

Background technique

With the mature application of speech recognition technology, the following problems or phenomena are currently common in the application of food content in refrigerator scenarios: the classification accuracy of food content is low, and the important information in food reviews is not combined and extracted, resulting in poor food push experience. Good or even poor content. Therefore, how to use deep learning to build an intelligent voice-based ingredient classification model has become a key technology and solution for improving the refrigerator experience. Moreover, smart refrigerator interaction is inseparable from multi-source heterogeneous data such as voice, text, and images. Therefore, how to maximize the use and integration of the most useful multi-modal data feature information, thereby optimizing the accuracy of smart voice food classification and improving the use of refrigerators. experience effect, the industry has not yet proposed a more effective solution.

Contents of the invention

The purpose of the present invention is to provide an item classification method and device based on deep learning.

The present invention provides an item classification method based on deep learning, including the steps:

Obtain real-time voice data containing item information and obtain historical text data;

Transcribe the real-time voice data into voice text data, and extract text features of the voice text data;

Extract text features of the historical text data;

jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;

After the joint features are combined through the fully connected layer, they are output to the classifier to calculate the score to obtain the classification result information, and determine the item category information;

Output the item category information.

As a further improvement of the present invention, the acquisition of historical text data specifically includes:

Obtain historical food ingredient review text data as the historical text data.

As a further improvement of the present invention, the transcribing the real-time speech data into speech text data, and extracting the text features of the speech text data specifically include:

Extract the real-time voice data features to obtain voice features;

Enter the speech feature into a speech recognition deep neural network model and transcribe it to obtain a first speech text vector;

The first speech text vector is input into a speech recognition convolutional neural network for encoding to obtain a second speech text vector.

As a further improvement of the present invention, the extraction of real-time voice data features specifically includes:

Extract the characteristics of the real-time speech data and obtain its Mel frequency cepstrum coefficient characteristics.

As a further improvement of the present invention, extracting text features of the historical text data specifically includes:

Convert the historical food material review text data into food material review word vectors;

The food material review word vector is input into a two-way long and short memory network model to obtain an food material review context feature vector containing contextual feature information based on the historical food material review text data.

As a further improvement of the present invention, it also includes the steps:

Based on the attention mechanism model, the text features of the speech text data and the historical ingredient review text data are enhanced.

As a further improvement of the present invention, based on the attention mechanism model, the text features of the speech text data and historical ingredient review text data are enhanced, specifically including:

Input the second speech text vector and the food review context feature vector into the self-attention mechanism model and the mutual attention mechanism model respectively;

Obtain a voice text attention feature vector that includes the weight information of the voice text data itself and the weight information between the voice text data and the historical ingredient review text data;

Obtain the food review text attention feature vector including the weight information of the historical food review text data itself and the weight information between the historical food review text number context feature vector and the voice text data.

As a further improvement of the present invention, the joint representation of the text features of the real-time speech data and the text features of the historical text data to obtain a joint feature vector specifically includes:

The voice text attention feature vector and the food review text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector.

As a further improvement of the present invention, after combining the text features through a fully connected layer, the text features are output to a classifier to calculate scores to obtain classification result information, which specifically includes:

After the joint feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the textual semantics of the speech text data and the historical food review text data and their normalized score results are calculated to obtain classification result information.

As a further improvement of the present invention, obtaining real-time voice data containing item information specifically includes:

Obtain the real-time voice data collected by the voice collection device, and/or

The real-time voice data transmitted from the client terminal is obtained.

As a further improvement of the present invention, the acquisition of historical ingredient review text data as the historical text data specifically includes:

Obtain the internally stored historical ingredient review text as historical ingredient review text data, and/or

Obtain the externally stored historical ingredient review text as historical ingredient review text data, and/or

Obtain the historical ingredient review text transmitted by the client terminal as historical ingredient review text data.

As a further improvement of the present invention, it also includes the steps:

Preprocessing the real-time voice data includes: framing and windowing the real-time voice data,

Preprocessing the historical text data includes: cleaning, annotating, word segmenting, and removing stop words on the speech text data.

As a further improvement of the present invention, the outputting the item category information includes:

Convert the item category information into speech for output, and/or

Convert the item category information into voice and transmit it to the client terminal for output, and/or

Convert the item category information into text for output, and/or

Convert the item category information into text and transmit it to the client terminal for output.

As a further improvement of the present invention, the step of transcribing the real-time voice data into voice text data, extracting text features of the voice text data, and extracting text features of the historical text data also includes:

Obtain the configuration data stored in the external cache, perform deep neural network calculations on the real-time voice data and the historical food review text data based on the configuration data, perform text transcription and extract text features.

The present invention also provides an item classification device based on deep learning, including:

Data acquisition module, used to acquire real-time voice data and historical text data;

A transliteration module, used to transcribe the real-time voice data into voice text data;

A feature extraction module, used to extract text features of the speech text data and extract text features of the historical text data;

A joint representation module, used to jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;

The result calculation module is used to combine the joint features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information, and to determine the item category information;

An output module is used to output the item category information.

The beneficial effects of the present invention are: the method provided by the present invention completes the task of identifying and classifying the acquired voice data, and by obtaining historical food material review text data, the historical food material review text data is used as data for pre-training and prediction models As part of the set, the text semantic feature information is more comprehensively obtained. By comprehensively using voice text data and historical ingredient review text data, the historical ingredient review text data is used as supplementary data to make up for the problem of less semantic information in the voice data text, effectively Improved text classification accuracy, thereby improving the accuracy of classifying items. Moreover, by building a network model that combines deep neural networks and convolutional neural networks, the accuracy of real-time speech recognition is improved; by building a neural network model that combines context information mechanisms, self-attention mechanisms, and mutual attention mechanisms, it can more fully Extract text semantic feature information. The overall model structure has excellent deep learning representation capabilities, high speech recognition accuracy, and high accuracy in classifying speech text, which greatly improves the accuracy and generalization ability of classifying item categories.

Description of the drawings

Figure 1 is a structural block diagram of a model involved in an item classification method based on deep learning in an embodiment of the present invention.

Figure 2 is a schematic diagram of the steps of an item classification method based on deep learning in an embodiment of the present invention.

Figure 3 is a schematic diagram of the steps of acquiring real-time voice data and acquiring historical text data in an embodiment of the present invention.

Figure 4 is a schematic diagram of the steps of translating the real-time voice data into voice text data and extracting text features of the voice text data in an embodiment of the invention.

Figure 5 is a schematic diagram of the steps of extracting text features of the historical text data in an embodiment of the invention.

Figure 6 is a schematic structural diagram of an object classification device based on deep learning in an embodiment of the present invention.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below in conjunction with the specific implementation modes of the present application and the corresponding drawings. Obviously, the described embodiments are only some of the embodiments of the present application, but not all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

Embodiments of the present invention are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements with the same or similar functions. Pass below The embodiments described with reference to the drawings are exemplary and are only used to explain the present invention and are not to be construed as limitations of the present invention.

As shown in Figure 1, it is a structural block diagram of the model involved in an item classification method based on deep learning provided by the present invention. As shown in Figure 2, it is a schematic diagram of the steps of the item classification method based on deep learning, which includes:

S1: Obtain real-time voice data containing item information and obtain historical text data.

S2: Transcribe the real-time voice data into voice text data, and extract text features of the voice text data.

S3: Extract text features of the historical text data.

S4: Jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features.

S5: After combining the joint features through the fully connected layer, output them to the classifier to calculate scores and obtain classification result information.

S6: Output the item category information.

The method provided by the present invention can be used by an intelligent electronic device to implement functions such as real-time interaction or message push with the user based on the user's real-time voice input. Illustratively, in this embodiment, a smart refrigerator is taken as an example, and the method is explained in combination with a pre-trained deep learning model. Based on the user's voice input, the smart refrigerator classifies the text content corresponding to the user's voice, thereby judging the category of items involved in the voice, and pushing relevant classification information based on the item classification results. Further, in this embodiment, the classification of food materials in a smart refrigerator is taken as an example. In other embodiments, the method provided by the present invention can also be applied to other items that need to be stored in the refrigerator, such as medicines, Cosmetics, etc. are classified.

As shown in Figure 3, in step S1, it specifically includes:

S11: Obtain the real-time voice data collected by the voice collection device, and/or

The real-time voice data transmitted from the client terminal is obtained.

S12: Obtain the internally stored historical ingredient review text as historical ingredient review text data, and/or

The real-time voice mentioned here refers to the inquiry or instructional statements currently spoken by the user to the intelligent electronic device or to the client terminal device that is communicatively connected to the intelligent electronic device. As in this embodiment, the implementation The current voice is a sentence containing relevant information such as the category of items stored in the smart refrigerator. The user can ask questions such as "What vegetables are in the refrigerator today", "What are the beef ingredients in the refrigerator today", or the user can issue a reminder such as " Types of beverages left in the refrigerator" and other commands. Based on the above information, the processor of the smart refrigerator determines the category of relevant items through the method provided by the present invention, and then performs real-time voice interaction or pushes relevant information with the user.

Specifically, in this implementation, obtaining historical text data includes:

Obtain historical food ingredient review text data as the historical text data.

The historical ingredient review text data described here refers to the transcribed text of user comments on the ingredients in the past, such as "The chili I put in today is very spicy" and "The yogurt of a certain brand I put in yesterday was very good." "Drink", etc. Furthermore, it may also include text data of relevant food reviews directly input by the user. The historical food review text usually contains item information that the user is interested in. Selecting it as the historical text data can effectively supplement information such as item categories.

In other embodiments of the present invention, it is also possible to obtain texts such as the transcribed text of relevant questions and instructions after the user asked questions or issued instructions in the past, or the explanatory voice that the user issued based on the items he put in during the previous use. The text and other historical text data will not be described in detail here.

As described in step S11, in this embodiment, the user's real-time voice can be collected through voice collection devices such as pickups and microphone arrays installed in the smart refrigerator. During use, when the user needs to interact with the smart refrigerator, the user can directly interact with the smart refrigerator. The smart refrigerator emits a voice. In addition, the transmitted real-time voice of the user can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol. The client terminal is an electronic device with information sending function, such as a mobile phone, tablet computer, smart speaker, smart bracelet or Bluetooth. During the use of smart electronic devices such as headphones, the user directly sends voice to the customer terminal, and the customer terminal collects the voice and transmits it to the smart refrigerator through wireless communication methods such as wifi or Bluetooth. This enables a multi-channel real-time voice acquisition method, which is not limited to sending voice to smart refrigerators. When users have interaction needs, they can send real-time voice through any convenient channel, which can significantly improve user convenience. In other embodiments of the present invention, one or more of the above real-time voice acquisition methods may also be used, or the real-time voice may be acquired through other channels based on existing technology, and the present invention does not impose specific limitations on this.

As described in step S12, in this embodiment, the data stored in the internal memory of the smart refrigerator can be read. historical food material review text to obtain the historical food material review text data. Moreover, the historical food material review text data can also be obtained by reading the historical food material review text stored in the external storage device configured in the smart refrigerator. The external storage device is a mobile storage device such as a U disk, SD card, etc., by setting the external storage The device can further expand the storage space of smart refrigerators. Moreover, the historical food review text data stored in a client terminal such as a mobile phone or a tablet computer or an application software server can also be obtained. The realization of multi-channel historical text acquisition channels can greatly increase the data volume of historical text information, thereby improving the accuracy of subsequent speech recognition. In other embodiments of the present invention, one or more of the above methods for obtaining historical food review text data may also be used, or the historical food review text data may be obtained through other channels based on existing technology. There are no specific restrictions on this.

Further, in this embodiment, the smart refrigerator is configured with an external cache, and at least part of the historical food material review text data is stored in the external cache. As the use time increases, the historical food material review text data increases. By adding Part of the data is stored in the external cache, which can save the internal storage space of the smart refrigerator, and when performing neural network calculations, directly reading the historical food review text data stored in the external cache can improve algorithm efficiency.

Specifically, in this embodiment, the Redis component is used as the external cache. The Redis component is currently a distributed cache system that uses a relatively widely used key/value storage structure. It can be used as a database, cache and message queue. acting. Other external caches such as Memcached may also be used in other embodiments of the present invention, and the present invention places no specific limitations on this.

To sum up, in steps S11 and S12, real-time voice data containing item information and historical ingredient review text data can be flexibly obtained through multiple channels, which not only improves the user experience, but also ensures the amount of data and effectively improves the user experience. Algorithmic efficiency.

Further, step S1 also includes the step of preprocessing the data, which includes:

S13: Preprocess the real-time voice data, including: performing frame processing and windowing processing on the real-time voice data.

S14: Preprocess the historical text data, including cleaning, annotating, word segmenting, and removing stop words on the speech text data.

Specifically, in step S13, the speech is segmented according to the specified length (time period or number of samples), Structured into a programmable data structure, the frame processing of the speech is completed to obtain the speech signal data. Then, the speech signal data is multiplied by a window function, so that the originally non-periodic speech signal exhibits some characteristics of the periodic function, completing the windowing process. Furthermore, pre-emphasis processing can be performed before the frame processing to emphasize the high-frequency part of the speech to eliminate the influence of lip radiation during the voicing process, thereby compensating for the high-frequency part of the speech signal that is suppressed by the articulation system, and can Highlight the high frequency resonance peaks. In addition, after the windowing process, steps such as filtering audio noise points and enhancing vocal processing can be performed to complete the enhancement of the real-time voice data, extract the characteristic parameters of the real-time voice, and make the real-time voice data Meet the input requirements of subsequent neural network models.

Specifically, in step S14, irrelevant data and duplicate data in the historical food material review text data set are deleted, and abnormal value and missing value data are processed, and information irrelevant to classification is initially screened out, and the historical food material review text data is cleaned. deal with. Then, the historical food review text data is annotated with category labels using methods based on rule statistics, and word segmentation methods based on string matching, word segmentation methods based on understanding, word segmentation methods based on statistics, and word segmentation methods based on rules, etc. The historical food review text data is subjected to word segmentation processing. After that, stop words are removed and the preprocessing of the historical food review text data is completed, so that the historical food review text data meets the input requirements of the subsequent neural network model.

In step S13 and step S14, the specific algorithm used to preprocess the real-time voice data and the historical food review text data can refer to the current technology in the field, and will not be described again here.

As shown in Figure 4, in step S2, it specifically includes the following steps:

S21: Extract the real-time voice data features to obtain voice features.

S22: Enter the speech feature into the speech recognition deep neural network model and transcribe it to obtain the first speech text vector.

S23: Input the first speech text vector into a speech recognition convolutional neural network for encoding to obtain a second speech text vector.

In step S21, extracting the real-time voice data features specifically includes:

Extract the characteristics of the real-time speech data and obtain its Mel-scale Frequency Cepstral Coefficients (MFCC for short). MFCC is a discernible component in the speech signal. It is a cepstrum parameter extracted in the Mel scale frequency domain. The Mel scale describes the frequency of the human ear. The nonlinear characteristics of frequency, the parameters of MFCC take into account the human ear's sensitivity to different frequencies, and are especially suitable for speech recognition and speaker identification.

For example, step S21 may include:

The preprocessed real-time speech data is subjected to fast Fourier transform to obtain the energy spectrum of each frame of real-time speech data signal, and the energy spectrum is passed through a set of Mel-scale triangular filter banks to smooth the spectrum and eliminate The role of harmonics highlights the formants of real-time speech, and then the MFCC coefficient characteristics are obtained through further logarithmic operations and discrete cosine transforms.

In other embodiments of the present invention, characteristic parameters such as the Perceptual Linear Predictive (PLP) or Linear Predictive Coding (LPC) characteristics of the real-time speech data can also be obtained through different algorithm steps. To replace the MFCC features, specific selection can be made based on the actual model parameters and the field of practical application of this method, and the present invention does not impose specific restrictions on this.

For the specific algorithm steps involved in the above steps, reference can be made to the current state of the art in the field, and details will not be described again here.

In step S22, the text content of the real-time speech data is transcribed through the pre-trained speech recognition deep neural network to obtain the first speech text vector.

In this implementation, speech recognition is completed directly through a deep neural network model. Compared with models such as the Gaussian mixture model commonly used in the existing technology, the deep neural network model avoids the assumption that acoustic features need to obey independent and identical distribution, and is different from the Gaussian mixture model. The network inputs in the hybrid model are different. The deep neural network model is obtained by splicing and overlapping several adjacent frames, so that it can better utilize context information, obtain more speech feature information, and have higher speech recognition accuracy.

Furthermore, in this embodiment, the algorithm steps involved in step S21 can be combined into the deep neural network model to make the overall model structure more balanced.

After obtaining the first speech text vector, it is encoded through a speech recognition convolutional neural network. Since the convolutional neural network has translation invariance in time and space, the acoustic features of the speech recognition are modeled based on CNN. , can eliminate the diversity of the speech signal, complete its encoding work, and the second speech text vector finally obtained contains high-level feature semantic information of the real-time speech data.

In other embodiments of the present invention, the real-time speech data can also be transcribed into the speech text data by constructing other structural neural network models or using models such as Gaussian mixture models, as long as the real-time speech data can be Just transcribe it into the voice text data.

To sum up, the text transcription and feature extraction of the real-time voice data are completed through step S2.

As shown in Figure 5, in step S3, it specifically includes:

S31: Convert the historical food material review text data into food material review word vectors.

S32: Input the food material review word vector into a two-way long and short memory network model to obtain an food material review context feature vector containing contextual feature information based on the historical food material review text data.

In step S31, in order to convert the text data into a vectorized form that can be recognized and processed by a computer, the historical food material review text data can be converted into the food material review word vector through the Word2Vec algorithm, or other methods such as Glove can also be used to convert the historical food material review text data into the food material review word vector. Algorithms and other existing algorithms in the field can be converted to obtain the word vectors, and the present invention does not impose specific restrictions on this.

In step S32, the Bi-directional Long Short-Term Memory (BiLSTM) is composed of the forward Long Short-Term Memory (LSTM) and the backward long short memory network. The LSTM model It can better obtain the long-distance dependencies of text semantics, and based on it, the BiLSTM model can better obtain the bidirectional semantics of text. Input multiple food review word vectors into the BiLSTM model respectively. After forward LSTM and backward LSTM, the hidden layer state representing effective information output at each time step is obtained, and the described word vectors with contextual context information are output. Food review context feature vector.

In other embodiments of the present invention, a common recurrent network model in the field such as a Gated Recurrent Unit (GRU) network can also be used to extract contextual feature information, and the present invention does not impose specific limitations on this.

In other embodiments of the present invention, steps may also be added to step S3:

S33: Input the second speech text vector into the speech recognition bidirectional long and short memory network model, and obtain a speech text context feature vector containing context feature information based on the speech text data.

This further increases the contextual feature information of the speech text data. However, based on the overall model structure, in this implementation, the speech recognition bidirectional long and short memory network model is not added, so that the overall model The structure is more symmetrical and balanced. Those skilled in the art can make specific adjustments to the number of model layers based on the actual model structure. The present invention does not impose specific restrictions on this.

Therefore, through steps S2 and S3, the feature extraction of the voice text data and the historical ingredient review text data are respectively completed, different semantic feature information is obtained, and useful text information is extracted, which improves the accuracy of item classification. accuracy, avoid the loss or filtering of useful information, and improve the performance of the model.

Further, in some embodiments of the present invention, after step S3, there are also steps:

S3a: Based on the attention mechanism model, enhance the text features of the speech text data and the historical ingredient review text data.

Specifically, step S3a includes:

The attention mechanism can guide the neural network to focus on more critical information and suppress other non-critical information. Therefore, by introducing the attention mechanism, the local key features or weight information of the output text data can be obtained, thereby further reducing model training. Irregular error alignment phenomenon of time series.

Here, the input second voice text vector and the food review context feature vector are given their own weight information through the self-attention mechanism model, thereby obtaining the text semantic features of the voice text data and the historical food review text data. internal weight information. And further assign the input second voice text vector and the food review context feature vector to their mutual correlation weight information through the mutual attention mechanism model, thereby obtaining the voice text data and the historical food review text. Association weight information between data. The finally obtained speech text attention feature vector and food review text attention feature vector enhance the importance of different parts of text semantic feature information, further optimizing the interpretability of the model.

In other embodiments of the present invention, the text feature enhancement of the speech text data and the historical ingredient review text data can also be completed based only on the self-attention mechanism model, or through other algorithm models.

Further, in some embodiments of the present invention, steps S2, S3, and S3a may also include:

Obtain the configuration data stored in the external cache, perform deep neural network calculations on the voice text data and the historical ingredient review text data based on the configuration data, perform text transcription and extract the voice text data and the historical ingredient Text features of review text data.

Here, the calculation efficiency of the algorithm is improved by configuring an external cache, and effectively solves the problems of time response and space calculation complexity caused by the large amount of historical food review text data.

In other embodiments of the present invention, the order of the layers of the deep neural network can be adjusted or some layers can be omitted as needed, as long as the text classification of the voice text data and the historical food review text data can be completed. The invention places no specific limitations on this.

In step S4, it specifically includes:

The voice text attention feature vector and the food review text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector, and the multi-modal joint feature vector is obtained It integrates optimal representation capabilities such as contextual information of text semantics, feature useful information, high-level features, and the different importance of useful features. It has rich semantic feature information and can obtain excellent text and speech representation capabilities.

It should be noted that in the current neural network model, there is no clear boundary between multi-modal joint feature representation and multi-modal fusion. Therefore, in some embodiments of the present invention, step S4 may also be: The speech text attention feature vector and the food review text attention feature vector are fused to obtain a fusion feature vector. Multi-modal joint feature representation and multi-modal fusion are intended to combine the real-time voice data and the historical food review text to better extract and represent the feature information of both.

In step S5, it specifically includes:

After the attention feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the text semantics of the speech text data and the historical food review text data and their normalized score results are calculated to obtain classification result information.

In other embodiments of the present invention, other activation functions can also be selected according to the model structure. There are no specific restrictions on this.

In summary, the method provided by the present invention sequentially completes the recognition and classification tasks of the acquired voice data through the above steps, and by obtaining historical food material review text data, the historical food material review text data is used as pre-training and prediction As part of the data set of the model, the text semantic feature information is more comprehensively obtained. By comprehensively using speech text data and historical ingredient review text data, the historical ingredient review text data is used as supplementary data to make up for the lack of semantic information in the voice data text. problem, effectively improving the accuracy of text classification, thereby improving the accuracy of classifying items. Moreover, by building a network model that combines deep neural networks and convolutional neural networks, the accuracy of real-time speech recognition is improved; by building a neural network model that combines context information mechanisms, self-attention mechanisms, and mutual attention mechanisms, it can more fully Extract text semantic feature information. The overall model structure has excellent deep learning representation capabilities, has high accuracy in classifying speech text, and greatly improves the accuracy and generalization ability of classifying item categories.

In step S6, it specifically includes:

Convert the item category information into speech for output, and/or

Convert the item category information into text for output, and/or

As described in step S6, in this embodiment, after the classification result information is obtained through the previous steps and the item category information is determined, it can be converted into voice, and the item category information can be broadcast through the sound playback device built in the smart refrigerator. This allows direct voice interaction with the user, or the item category information can be converted into text and displayed directly through the display device configured in the smart refrigerator. Moreover, the voice communication of item category information can also be transmitted to the client terminal for output. Here, the client terminal is an electronic device with an information receiving function, such as transmitting voice to mobile phones, smart speakers, Bluetooth headsets and other devices for broadcast, or classifying results. The information text is transmitted to client terminals such as mobile phones and tablets or application software installed on the client terminal through text messages, emails, etc. for users to review. Thus, a multi-channel and multi-type classification result information output method is realized. The user is not limited to only obtaining relevant information near the smart refrigerator. With the multi-channel and multi-type real-time voice acquisition method provided by the present invention, the user can directly obtain relevant information remotely. Interacting with the smart refrigerator is extremely convenient and greatly improves the user experience. In other embodiments of the present invention, only the above-mentioned One or more of the classification result information output methods, or the classification result information can also be output through other channels based on the existing technology, and the present invention does not impose specific restrictions on this.

To sum up, the present invention provides an item classification method based on deep learning, which obtains real-time voice data containing item information through multiple channels. After the real-time voice data is transcribed into text, it is combined with historical ingredient review text data through The deep neural network model fully extracts text semantic features, obtains item category information and outputs it through multiple channels. It significantly improves the accuracy of speech recognition and item category judgment, while making the interaction method more convenient and diverse, greatly improving the user experience.

As shown in Figure 6, based on the same inventive concept, the present invention also provides an item classification device 7 based on deep learning, which includes:

Data acquisition module 71, used to acquire real-time voice data and acquire historical text data;

Transcription module 72, used to transcribe the real-time voice data into voice text data;

Feature extraction module 73, used to extract text features of the voice text data and extract text features of the historical text data;

The joint representation module 74 is used to jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;

The result calculation module 75 is used to combine the joint features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information, and determine to obtain the item category information;

The output module 76 is used to output the item category information.

Based on the same inventive concept, the present invention also provides an electrical device, which includes:

Memory, used to store executable instructions;

The processor is configured to implement the above deep learning-based item classification method when running executable instructions stored in the memory.

Based on the same inventive concept, the present invention also provides a refrigerator, which includes:

Memory, used to store executable instructions;

Based on the same inventive concept, the present invention also provides a computer-readable storage medium, which stores executable Instructions, characterized in that when the executable instructions are executed by the processor, the above-mentioned deep learning-based item classification method is implemented.

It should be understood that although this specification is described in terms of implementations, not each implementation only contains an independent technical solution. This description of the specification is only for the sake of clarity. Persons skilled in the art should take the specification as a whole and understand each individual solution. The technical solutions in the embodiments can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.

The series of detailed descriptions listed above are only specific descriptions of feasible implementations of the present invention and are not intended to limit the scope of protection of the present invention. Any equivalent implementations or changes that do not depart from the technical spirit of the present invention are All should be included in the protection scope of the present invention.

Claims

An item classification method based on deep learning, which is characterized by including the steps:

Obtain real-time voice data containing item information and obtain historical text data;

Transcribe the real-time voice data into voice text data, and extract text features of the voice text data;

Extract text features of the historical text data;

jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;

After the joint features are combined through the fully connected layer, they are output to the classifier to calculate the score to obtain the classification result information, and determine the item category information;

Output the item category information.
The item classification method based on deep learning according to claim 1, characterized in that said obtaining historical text data specifically includes:

Obtain historical food ingredient review text data as the historical text data.
The item classification method based on deep learning according to claim 1, wherein the transcribing the real-time speech data is speech text data, and extracting the text features of the speech text data specifically includes:

Extract the real-time voice data features to obtain voice features;

Enter the speech feature into a speech recognition deep neural network model and transcribe it to obtain a first speech text vector;

The first speech text vector is input into a speech recognition convolutional neural network for encoding to obtain a second speech text vector.
The item classification method based on deep learning according to claim 3, wherein the extracting the real-time voice data features specifically includes:

Extract the characteristics of the real-time speech data and obtain its Mel frequency cepstrum coefficient characteristics.
The item classification method based on deep learning according to claim 3, characterized in that extracting text features of the historical text data specifically includes:

Convert the historical food material review text data into food material review word vectors;

Input the food review word vector into the two-way long and short memory network model to obtain information based on the historical food The ingredient review context feature vector is the ingredient review text data context feature information.
The deep learning-based item classification method according to claim 5, further comprising the steps of:

Based on the attention mechanism model, the text features of the speech text data and the historical ingredient review text data are enhanced.
The item classification method based on deep learning according to claim 6, characterized in that the attention mechanism model is used to enhance the text features of the voice text data and historical ingredient review text data, specifically including:

Input the second speech text vector and the food review context feature vector into the self-attention mechanism model and the mutual attention mechanism model respectively;

Obtain a voice text attention feature vector that includes the weight information of the voice text data itself and the weight information between the voice text data and the historical ingredient review text data;

Obtain the food review text attention feature vector including the weight information of the historical food review text data itself and the weight information between the historical food review text number context feature vector and the voice text data.
The item classification method based on deep learning according to claim 7, wherein the joint representation of the text features of the real-time voice data and the text features of the historical text data to obtain a joint feature vector specifically includes:

The voice text attention feature vector and the food review text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector.
The item classification method based on deep learning according to claim 7, characterized in that, after combining the text features through a fully connected layer, the text features are output to a classifier to calculate scores to obtain classification result information, specifically including:

After the joint feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the textual semantics of the speech text data and the historical food review text data and their normalized score results are calculated to obtain classification result information.
The item classification method based on deep learning according to claim 1, characterized in that the obtained Obtain real-time voice data containing item information, including:

Obtain the real-time voice data collected by the voice collection device, and/or

The real-time voice data transmitted from the client terminal is obtained.
The item classification method based on deep learning according to claim 2, characterized in that said obtaining historical ingredient review text data as the historical text data specifically includes:

Obtain the internally stored historical ingredient review text as historical ingredient review text data, and/or

Obtain the externally stored historical ingredient review text as historical ingredient review text data, and/or

Obtain the historical ingredient review text transmitted by the client terminal as historical ingredient review text data.
The deep learning-based item classification method according to claim 1, further comprising the steps of:

Preprocessing the real-time voice data includes: framing and windowing the real-time voice data,

Preprocessing the historical text data includes: cleaning, annotating, word segmenting, and removing stop words on the speech text data.
The item classification method based on deep learning according to claim 1, wherein the outputting the item category information includes:

Convert the item category information into speech for output, and/or

Convert the item category information into voice and transmit it to the client terminal for output, and/or

Convert the item category information into text for output, and/or

Convert the item category information into text and transmit it to the client terminal for output.
The object classification method based on deep learning according to claim 1, characterized in that the transcribing the real-time speech data is speech text data, extracting text features of the speech text data; extracting the text of the historical text data Features, also include:

Obtain the configuration data stored in the external cache, perform deep neural network calculations on the real-time voice data and the historical food review text data based on the configuration data, perform text transcription and extract text features.
An item classification device based on deep learning, which is characterized by including:

Data acquisition module, used to acquire real-time voice data and historical text data;

A transliteration module, used to transcribe the real-time voice data into voice text data;

A feature extraction module, used to extract text features of the speech text data and extract text features of the historical text data;

A joint representation module, used to jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;

The result calculation module is used to combine the joint features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information, and to determine the item category information;

An output module is used to output the item category information.