AU2019101138A4

AU2019101138A4 - Voice interaction system for race games

Info

Publication number: AU2019101138A4
Application number: AU2019101138A
Authority: AU
Inventors: Shiyun Cheng; Jingyu Li; Xuanning Liang; Hongbin LV; Yunxuan Yang; Zhenyu Zhou
Original assignee: Cheng Shiyun Miss
Current assignee: Cheng Shiyun Miss
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2019-10-31
Anticipated expiration: 2027-09-30

Abstract

This invention lies in the field of Cross-media retrieval. It is a system that includes an image recognition system, a text recognition system, and a voice transformation system based on deep learning. This invention consists of the following steps: Firstly, we gathered a sufficient number of pictures and texts from the Internet. Secondly, the data set having been selected and preprocessed is divided into training set, validate set, and test set. We put the training set of data into the neural network in batches. By constantly adjusting the parameters of the network such as the base learning rate, the model will continually be improved by the computer. Then we put the data into the validate set and the test set to examine the model we build. Finally, we get a model that reach our expectation which can realize interaction of image and text. In brief, this invention can automatically recognize various kinds of road condition and convey information to the people simultaneously. Road Information Collection Intelligent Information Input Voice Transformation Question Exchange Ans r oieutput Algorithm Module Conversation History Histord Retun Figure 1 IImge | R-CNN 00Attention Weight Sum 36x2048 F 2048 nSTM -3072 Fully Connected Layer Tanh Histo k LSMi12 To Decoder Figure 2

Description

Voice interaction system for race games

FIELD OF THE INVENTION

This invention is in the field of digital signal processing and multimedia interaction powered by deep learning.

BACKGROUND

With the advancement of the computer technology, there are now various kinds of computer games, especially race games (RAC). However, there are still many issues in race games. The first problem is that the interactivity is not fully-fledged. Information can only be input to the computers by manipulators without any feedback from the computers. The second problem is that players cannot pay attention to the information about the road conditions while they are driving. The main reason is that the monitor is usually too small to show the

3-D(three-dimensional) scenes of the game and it is hard for players to notice the road conditions. In order to figure out these problems, we designed a voice interaction system to improve the game experience.

In the traditional race game, you can only recognize the fickle road condition with your own eyes. However, players often pay too much attention on driving cars so that they can hardly realize the danger on the

2019101138 30 Sep 2019 road. Players can only use their eyes to find out the danger on roads, thus the experience is not so good.

With the development of high-performance computing platform and big data processing technique, deep learning has been become a powerful method for image recognition. It is based on learning data representations, as opposed to task-specific algorithms. Supported by a large quantity of data, it can automatically form a mechanism of self-learning and extracting features from different kinds of road conditions.

We use deep learning as the basis of our invention. The invention works as follows: firstly, the computer receives information, including images, questions, and histories. We use Faster-RCNN to extract features from images, and use LSTM to extract text information from questions and histories. Later, we use ATTENTION mechanism to mix the information. The information then will be sent to an encoder. After that, the decoder transforms the information from the encoder into the text information. At last, we use a voice transformer to transform the text into voice. Therefore, we realize the information exchange between people and the computer.

In the future, this technique will have a bright prospect. On one hand, this invention can be applied to VR games which will be one of the most

2019101138 30 Sep 2019 popular research directions. While on the other hand, this invention can also be applied to automatic drive which will definitely change people’s daily life.

SUMMARY

In order to work out the problem that players cannot take the proper action during a race game due to the restriction of the monitor, our design proposes a voice interaction system for race games based on deep learning.

In order to solve the situation that the player's field of vision is not broad enough to make correct operation in the racing game, our design proposes a voice interactive system based on deep learning. The system comprehensively interacts with information from different media, obtaining word vectors through the Embedding Layer, and extracting text information features through LSTM Layer. The algorithm also uses the ATTENTION mechanism to better fuse information from images and texts. The design improves the extraction of text information features and overcomes some technical problems of model training such as over-fitting. This method solves the problem of integrated interaction between different media.

2019101138 30 Sep 2019

In order to build the corpus used to train the model, we use Python to segment the words from a large amount of text, and perform word frequency statistics to obtain a large amount of data. We divide the data into training set, validate set and test set. We train the model parameters on the training set, and test the evaluation model performance on the validate set and test set.

We use the Encoder-Decoder architecture to encode and decode information. In the Encoder section, we convert the text into a word vector through the Semantic Layer network, and then characterize the text information using the LSTM model commonly used in the NLP field. In the extraction of image features, we use the best-performing Laster-RCNN method. Through the ATTENTION mechanism, we fuse text and image features to obtain comprehensive information on text and image features. In the Decoder part, we send the Answer information to the Semantic Layer to obtain the word vector, then use LSTM Layer to extract the feature information from the Answer and fuse it with the comprehensive information obtained by the Encoder part. Einally, the corresponding score is obtained by the Softmax activation function. In order to avoid the over-fitting phenomenon in the deep learning model, we introduced dropout. And in order to better learn the model parameters, we introduced the Cross Entropy Loss Eunction.

2019101138 30 Sep 2019

Our convolutional neural network is a sequence of Layers. Figure 1 displays the architecture of our network. Our network has 2 semantic Layers, 3 LSTM Layers, Faster-RCNN Layer, ATTENTION Layer, Fully Connected Layer, and Softmax Layer.

The semantic Layer will change the information from text form into vectors. Because the computers can only recognize numbers data so we need to transform the information into word vectors enabling the computer to calculate.

LSTM Layer will extract features from the word vectors and change their sizes. It can also effectively increase the accuracy of the prediction.

Faster-RCNN Layer will extract features from the images and create vectors of the images.

ATTENTION Layer is used to reduce the dimension of the vectors from the Faster-RCNN Layer, from three-dimension to one-dimension thus make it capable of calculating with word vectors.

Fully Connected Layer will compute the class scores, each neuron in this Layer will be connected to all the numbers in the previous volume, so their activations can hence be computed with a matrix multiplication.

Softmax Layer is used in the process of multi-classification, mapping the output of multiple neurons to the interval of (0, 1), which can be

2019101138 30 Sep 2019 understood as probability to conduct multi-classification. Then the computer can report an answer that has the highest possibility.

The design can complete the extraction of the image information and realize the interaction with the player, thereby improving the game experience of the player.

DESCRIPTION OF DRAWINGS

Figure 1 is the structure of the whole system

Figure 2 is the structure of Encoder in Algorithm module

Figure 3 is the structure of Decoder in Algorithm module

Figure 4(a) and figure 4(b) are the internal structure of Semantic Layer

Figure 5 shows the visual interface

Figure 6 is the procedure of the Training

DESCRIPTION OF PREFERRED EMBODIMENT

Network Design

Figure 1 shows the structure of our neural network. The network has a Faster-RCNN Layer, two semantic Layers, one ATTENTION Layer, three LSTM Layers, one Fully Connected Layer, and a Softmax Layer.

(1) Faster-RCNN Layer

2019101138 30 Sep 2019

The input data of the Faster-RCNN Layer is the images that we gathered from the internet. Then we send the pictures into the Faster-RCNN Layer, and the size of the images becomes (3236x2048).

(2) ATTENTION Layer

After we get the information from the first Layer, we put the information into the ATTENTION Layer. This Layer changes the size of the information from (32x36x2048) to a one-dimensional vector with the size of 2048.

(3) Semantic Layer 1

The information from the questions is sent to the first semantic Layer. Inside the semantic Layer, there are two algorithms —ELMo and GloVe. The data through each algorithm has a size of 300 each, and after passing through these algorithms the data from the two sides will be combined together thus create a whole word vector that has a size of 600. Because the computer can only understand numbers, we have to change the text information into vectors thus enable the computers to understand. In the meantime, we input conversation histories information into the semantic Layer to create word vectors. Because we use the same corpus, the words from the texts are also the same. That is the reason why we use only one semantic Layer.

2019101138 30 Sep 2019 (4) LSTM Layer 1

The word vectors from the question input then go ahead into the

LSTM Layer 1. This Layer transforms the size of the word vectors from 600 to (320x512). The LSTM Layer has an evident advantage: the data sent into LSTM Layer is a completed sentence so the words in the data have continuity, meaning that LSTM Layer can catch the relationship between words, which can raise the accuracy of the prediction.

(5) LSTM Layer 2

We do the same operation to the word vectors which come from the conversation histories. Also, the LSTM Layer 2 transforms the vectors into the size of (320x512).

(6) Fully Connected Layer

Before we put data into the Fully Connected Layer, we combine the vectors which we get in the former states. Add up the three vectors in the same dimension. The dimension after operation will be: 512+512+2048=3072.

After summing up the vectors, the data from three different origins become one vector that consists of all the information before. However, the vector with the size of 3072 is too huge for the computer

2019101138 30 Sep 2019 to calculate, as for the problem of time consuming, we decide to compress the data to reduce the calculating time. Then we put this vector which size is 3072 into the Fully Connected Layer and get a vector that has the size of 512. In order to prevent over-fitting, we employ a dropout operation after the Fully Connected Layer. Later we process the vector by a Tangent function and send the data to the decoder to translate the data into the information that we human can understand.

(7) Semantic Layer 2

After we set up our model, we need to train it. We first get a huge dataset of one hundred answers to each question before. As we know, the computer cannot understand text information, so we must change the data into vectors. The method we achieve this is using a second semantic Layer. With the help of this Layer, we can change the input of answers from (32x10x100x20) to (32000x20x300).

(8) LSTM Layer 3

In order to give the data, which is gathered from the answers, comparability, we have to make the size of the data the same as the size of the data from the encoder. The data will be changed from (32000x20x300) to (32000x512). Because the data from the encoder

2019101138 30 Sep 2019 has only two dimensions, the computer can only calculate data in the same dimension, so we have to reduce dimension of the data from 3 to

2. Then the computer will be able to calculate the data.

(9) Softmax Layer

After we receive data from the third LSTM Layer, we need to interact the data from the Layer and the data outputted by the encoder. We employ an element-wise product to minus the data and sum them up so as to change the vector into one-dimensional from two vectors of (32000x512) to one vector of (32000). Then we unsqueeze the vector back into three-dimensional (32x10x100). After fishing that, we send the data to the Softmax Layer and get the weights of all the 100 answers. With the weights of the questions, we can get a rank of the answers and report a response that has the highest possibility to the users.

Procedure

Stepl: Encoder

Since Computer cannot understand human’s language, we need to translate our language to computer language, which is the process “Encoder”. During this process, we need to encode image, questions from the player and conversation history.

io

2019101138 30 Sep 2019

For the computers are unable to recognize the object in the picture like human, instead, computer recognizes the picture as two-dimension matrix, the difficulties we faced during the research project at first is how to let the computer understand different object in the picture. To solve the problem, we use the Convolutional Layers from Faster R-CNN, which is advanced convolutional neural network in deep learning, to help computer acquire the feature of the computer. Then, in the activate Layer, the feature will be determined the anchor whether it belongs to negative or positive through Softmax algorithm in PRN (Region Proposal Networks), and the bounding box regression will revise the anchors in order to achieve the precise proposal. Afterwards, Pooling Layer will collect the feature map from convolutional Layer and proposal from PRN to compound proposal feature maps. Finally, it will classify the proposal through proposal feature maps and obtain the accurate location.

However, the size of the information from the image is too large to calculate, we need to squeeze it. The Fully Connected Layer solves this problem. After the Fully Connected Layer, the size of the information is squeezed while the main information is still kept.

Another problem in image encoder is that when the input sequence is very long, it is difficult for the model to learn a reasonable vector representation. To solve this problem, we need to introduce Attention

2019101138 30 Sep 2019

Mechanism. Attention Mechanism and human observation of the external Mechanism is very similar. Humans observation focus on the significant part selectively to acquire the major information instead of as a whole. For example, when people notice a person, they usually look at the person’s face then other parts of the body. Afterwards, people will comprise the feature they observed together as a whole impression of this person. Attention Mechanism works nearly the same way as human. Attention Mechanism helps the model give different weight in different input and extract the more crucial information, which make the model have a more accurate decision as well as keep the same calculation and storage. After this process, the image will be squeeze to a valuable tensor for computer to calculate more easily. The next step is to encode questions from the player and conversation history.

The questions from the player and conversation history are comprise of human language, which the computer is difficult to understand. To solve this problem, we need to give every word a vector, which is called Word2vector or semantic Layer in deep learning. However, Word2vector still cannot solve this problem perfectly since the computer cannot recognize the “distance” between words. For example, if player inputs “Asia ice”, which makes nonsense to us, but the computer cannot understand this and cannot output anything. WMD (Word Mover's

2019101138 30 Sep 2019

Distance) can model the distance between two documents as a combination of semantic distances for words in two documents. The computer trains this process for specific time, and will realize the similarity between two words. So, if you input something, the computer will output the most similar answer if there is no answer matches the players’ question.

WMD formula: T_tjc(i, j) i,j=\

Through Word2vector and WMD (Word Mover's Distance), the question and history can be understood by the computer. For each sentence, we only allow twenty words at most. We will add “0” at the end of sentence or delete the extra word if the sentence is not exactly 20 words in one sentence. Then, this sentence will go through the LSTM (Long Short-Term Memory) process. During this process, the computer will filter the useless information, remember the essential information and decide what will output.

The reason why we use LSTM instead of RNN in this process is that RNN has difficulty to remembering the memory long time ago because Gradient exploding is possibly happening in RNN. Besides RNN, LSTM have three controllers (input controller, output controller and lost controller) which have ability to filter the useless information, remember

2019101138 30 Sep 2019 the essential information and output the most efficient things.

Since then, image, question and history are understood by the computer. We need to contact these three together and ready for decoder. Though each separately information is not so big, after connection procedure, it is huge amount of data which the computer cannot deal with. Solving this problem, we need to use the Fully Connected Layer again to squeeze such huge data. That will bring us to decoder eventually.

Step2: Decoder

Figure 3 shows the structure of the decoder procedure. In the decoder module, we will preprocess encoder input data and the answer dataset, change them into the same size. Then combine these two vectors by Answer score module. Finally, use the Softmax excitation function to get the scores.

Firstly, this group changes two datasets into the same size in order to combine them. We collect the Encoder input data, which size is (32x10x512). The batch size is 32, we change the size of (32x10x512) into (32x10x100x512). It means that we test 32 image one time, each image will be asked 10 round and each round has 100 answers. The max length of each answer is 512. Then, we change the input from (32x10x100x512) to (32000x512) by :

2019101138 30 Sep 2019

Batch sizexroundsxnumber of answers=32x 10x100=32000

After that, we should preprocess the answer into vector and change its size. In the beginning, we have to embed the answer by semantic Layer. The internal structure of semantic Layer can be seen in Eigure 4. We set the embedding size as 300. We input the answer into ELMo and Glove model respectively, the output vector from ELMo would go through a Lully Connected Layer to change the size from 1024 to 300. We contact these two 300 dimensions vector into a 600 dimensions word vector. The size of this vector is (32x10x100x20x600) Next, we establish the LSTM Layer3, we put the vector from semantic Layer into this Layer. The size will change from (32000x20x600) to (32000x512).

Secondly, we combine these two data in Answer score module. There are two steps. Stepl: use element-wise product to the encoder input (32000x512) and answer vector (32000x512). The output product vector combines the both two vector characteristics and its size would not change. Step 2: add the 512 rows of the product vector, its size would be (32000x1)

Thirdly, we use Softmax activation function to get the weights of all the 100 answers. The Softmax function is shown as below:

e^Zi

2019101138 30 Sep 2019

Since we use the supervised learning, we know the right answer, we set the right answer as 1, wrong answers were set as 0.

Fourthly, we use Cross Entropy Loss Function to calculate the loss. Cross Entropy Loss Function describes the distance between two probability distributions, and the smaller the cross entropy is, the closer they are to each other. The formula is shown as below:

C =--[y Ina + (1- y)ln(l — a)]

X

We set ‘a’ as weight for each answer and ‘y’ to be 0 or 1 (0 for wrong answer or 1 for right answer). We can obtain the loss value by using it.

Finally, pass the loss back to change the parameter and retrain the model.

2019101138 30 Sep 2019

The following shows the pseudocode of the training

Algorithm Model Training Process

1: Input: Import Image data I, Question data Q, Answer data A, History data H

2: Set the maximum number of iterations epoch

3: Use Cross Entropy Loss as the loss function loss June

4: Use Stochastic Gradient Descent (SGD) as the optimizer op

5: Feed / to Faster R-CNN to get image feature I\

6: Randomly initialize model weights

Ί: for 1, 2, 3,..., epoch do

8: Encoder:

9: Set Qi <— Send Q to the Embedding layer

10: Set β₂ Send Ui to the LSTM layer

11: Set Η_λ <— Send H to the Embedding layer

12: Set //₂«— Send H\ to the LSTM layer

13: Set I₂ <— Do attention between Q₂ and /|

14: Set J <— Contact Q~·, H₂ and I₂

15: Decoder:

16: Set Α_γ <— Send A to the Embedding layer

17: Set A₂ <— Send A! to the LSTM layer

18: Set score <— Calculate the score of this answer based on A₂ and J

19: Set loss <— Calculate the loss value between the answer and the standard

20: Backward based on loss

21: end for

Testing:

Learn from the work of our predecessors, we use six parameters

R@1,R@5, R@10 , Mean , Mrr and Ndcg as the indexes to evaluate this model.

R: the recall rate, which indicates how many positive examples in the sample are predicted correctly.

R@1: the first question’s recall rate

2019101138 30 Sep 2019

R@5: the first five questions’ recall rate

R@10: the ten questions’ recall rate

Mrr(Mean Reciprocal Rank): it is an internationally common mechanism to evaluate the search algorithm, that is, the first result matches, and the score is 1, the second matching score is 0.5, the NTH matching score is 1/n, and the sentence score without match is O.The final score is the mean of all the scores.

NDCG(Normalized Discounted Cumulative Gain) :

The calculation formula of NDCG at node position n is shown in the figure below, n

N(n) = Z_n^(2^r0) - l)/log(l +y) z=l

According the result from table 1, we find the recall rate, Mrr and Ndcg is increasing, the value of mean is decreasing, which indicates that with the training of the data, the accuracy of the algorithm is improving. As a result, this whole model is valid. The visual interface is shown in Figure 5

2019101138 30 Sep 2019

Table 1 The results of the module

epoch	R@1	R@5	R@10	Mean	Mrr	Ndcg
0	0.3653	0.6546	0.7652	7.9093	0.5049	0.4262
1	0.4121	0.724	0.8326	5.9742	0.5567	0.4682
2	0.4441	0.7591	0.8602	5.1486	0.5879	0.5022
3	0.456	0.775	0.8723	4.9096	0.6005	0.5037
4	0.4698	0.7841	0.8794	4.6355	0.6119	0.5183
5	0.4758	0.7937	0.887	4.4394	0.6189	0.5242
6	0.4804	0.7973	0.8895	4.3659	0.6233	0.5272
7	0.482	0.8006	0.8926	4.2942	0.6254	0.5332
8	0.4875	0.8025	0.8953	4.241	0.6287	0.5355
9	0.4888	0.8054	0.8954	4.205	0.6304	0.5408
10	0.4895	0.8057	0.896	4.1757	0.6316	0.5468

Claims

1. A voice interaction system for race games, wherein the algorithm model is based on the idea of conventional deep-learning model; to be specific, a number of data are collected and categorized into training set in which a simulative training is done in order to study model parameters, validate set and testing set which are the modules being manipulated to assess the model performance.

2. The voice interaction system for race games of claim 1, wherein said design imports dropout in order to avoid over-fitting phenomenon; simultaneously, cross-entropy function is introduced to make betterment in the studying of model and parameter.

3. The voice interaction system for race games of claim 1, wherein said design implements the language interaction system in the realm of RAC (Race Game) and fulfills functions of the design of core algorithm.

4. The voice interaction system for race games of claim 1, wherein said design integrates information from different media (real-time conditions, text messages converted from voice message and so on) and then does interactive processing; an ATTENTION mechanism is adopted to better integrate the main information mentioned above.

5. The voice interaction system for race games of claim 1, wherein said core algorithm model adopts Encoder-Decoder structure to encode and decode information.