AU2019101138A4 - Voice interaction system for race games - Google Patents
Voice interaction system for race games Download PDFInfo
- Publication number
- AU2019101138A4 AU2019101138A4 AU2019101138A AU2019101138A AU2019101138A4 AU 2019101138 A4 AU2019101138 A4 AU 2019101138A4 AU 2019101138 A AU2019101138 A AU 2019101138A AU 2019101138 A AU2019101138 A AU 2019101138A AU 2019101138 A4 AU2019101138 A4 AU 2019101138A4
- Authority
- AU
- Australia
- Prior art keywords
- model
- data
- layer
- interaction system
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 14
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000012360 testing method Methods 0.000 claims abstract description 8
- 230000007246 mechanism Effects 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 10
- 238000013461 design Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 3
- 238000013136 deep learning model Methods 0.000 claims description 2
- 230000008676 import Effects 0.000 claims description 2
- 230000002452 interceptive effect Effects 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 abstract description 8
- 238000013527 convolutional neural network Methods 0.000 abstract description 5
- 238000013528 artificial neural network Methods 0.000 abstract description 2
- 230000009466 transformation Effects 0.000 abstract 2
- 239000013598 vector Substances 0.000 description 43
- 238000000034 method Methods 0.000 description 20
- 230000008859 change Effects 0.000 description 15
- 230000008569 process Effects 0.000 description 10
- 230000004913 activation Effects 0.000 description 3
- 238000001994 activation Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/40—Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment
- A63F13/42—Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment by mapping the input signals into game commands, e.g. mapping the displacement of a stylus on a touch screen to the steering angle of a virtual vehicle
- A63F13/424—Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment by mapping the input signals into game commands, e.g. mapping the displacement of a stylus on a touch screen to the steering angle of a virtual vehicle involving acoustic input signals, e.g. by using the results of pitch or rhythm extraction or voice recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F2300/00—Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
- A63F2300/10—Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by input arrangements for converting player-generated signals into game device control signals
- A63F2300/1081—Input via voice recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Machine Translation (AREA)
Abstract
This invention lies in the field of Cross-media retrieval. It is a system that includes an image recognition system, a text recognition system, and a voice transformation system based on deep learning. This invention consists of the following steps: Firstly, we gathered a sufficient number of pictures and texts from the Internet. Secondly, the data set having been selected and preprocessed is divided into training set, validate set, and test set. We put the training set of data into the neural network in batches. By constantly adjusting the parameters of the network such as the base learning rate, the model will continually be improved by the computer. Then we put the data into the validate set and the test set to examine the model we build. Finally, we get a model that reach our expectation which can realize interaction of image and text. In brief, this invention can automatically recognize various kinds of road condition and convey information to the people simultaneously. Road Information Collection Intelligent Information Input Voice Transformation Question Exchange Ans r oieutput Algorithm Module Conversation History Histord Retun Figure 1 IImge | R-CNN 00Attention Weight Sum 36x2048 F 2048 nSTM -3072 Fully Connected Layer Tanh Histo k LSMi12 To Decoder Figure 2
Description
Voice interaction system for race games
FIELD OF THE INVENTION
This invention is in the field of digital signal processing and multimedia interaction powered by deep learning.
BACKGROUND
With the advancement of the computer technology, there are now various kinds of computer games, especially race games (RAC). However, there are still many issues in race games. The first problem is that the interactivity is not fully-fledged. Information can only be input to the computers by manipulators without any feedback from the computers. The second problem is that players cannot pay attention to the information about the road conditions while they are driving. The main reason is that the monitor is usually too small to show the
3-D(three-dimensional) scenes of the game and it is hard for players to notice the road conditions. In order to figure out these problems, we designed a voice interaction system to improve the game experience.
In the traditional race game, you can only recognize the fickle road condition with your own eyes. However, players often pay too much attention on driving cars so that they can hardly realize the danger on the
2019101138 30 Sep 2019 road. Players can only use their eyes to find out the danger on roads, thus the experience is not so good.
With the development of high-performance computing platform and big data processing technique, deep learning has been become a powerful method for image recognition. It is based on learning data representations, as opposed to task-specific algorithms. Supported by a large quantity of data, it can automatically form a mechanism of self-learning and extracting features from different kinds of road conditions.
We use deep learning as the basis of our invention. The invention works as follows: firstly, the computer receives information, including images, questions, and histories. We use Faster-RCNN to extract features from images, and use LSTM to extract text information from questions and histories. Later, we use ATTENTION mechanism to mix the information. The information then will be sent to an encoder. After that, the decoder transforms the information from the encoder into the text information. At last, we use a voice transformer to transform the text into voice. Therefore, we realize the information exchange between people and the computer.
In the future, this technique will have a bright prospect. On one hand, this invention can be applied to VR games which will be one of the most
2019101138 30 Sep 2019 popular research directions. While on the other hand, this invention can also be applied to automatic drive which will definitely change people’s daily life.
SUMMARY
In order to work out the problem that players cannot take the proper action during a race game due to the restriction of the monitor, our design proposes a voice interaction system for race games based on deep learning.
In order to solve the situation that the player's field of vision is not broad enough to make correct operation in the racing game, our design proposes a voice interactive system based on deep learning. The system comprehensively interacts with information from different media, obtaining word vectors through the Embedding Layer, and extracting text information features through LSTM Layer. The algorithm also uses the ATTENTION mechanism to better fuse information from images and texts. The design improves the extraction of text information features and overcomes some technical problems of model training such as over-fitting. This method solves the problem of integrated interaction between different media.
2019101138 30 Sep 2019
In order to build the corpus used to train the model, we use Python to segment the words from a large amount of text, and perform word frequency statistics to obtain a large amount of data. We divide the data into training set, validate set and test set. We train the model parameters on the training set, and test the evaluation model performance on the validate set and test set.
We use the Encoder-Decoder architecture to encode and decode information. In the Encoder section, we convert the text into a word vector through the Semantic Layer network, and then characterize the text information using the LSTM model commonly used in the NLP field. In the extraction of image features, we use the best-performing Laster-RCNN method. Through the ATTENTION mechanism, we fuse text and image features to obtain comprehensive information on text and image features. In the Decoder part, we send the Answer information to the Semantic Layer to obtain the word vector, then use LSTM Layer to extract the feature information from the Answer and fuse it with the comprehensive information obtained by the Encoder part. Einally, the corresponding score is obtained by the Softmax activation function. In order to avoid the over-fitting phenomenon in the deep learning model, we introduced dropout. And in order to better learn the model parameters, we introduced the Cross Entropy Loss Eunction.
2019101138 30 Sep 2019
Our convolutional neural network is a sequence of Layers. Figure 1 displays the architecture of our network. Our network has 2 semantic Layers, 3 LSTM Layers, Faster-RCNN Layer, ATTENTION Layer, Fully Connected Layer, and Softmax Layer.
The semantic Layer will change the information from text form into vectors. Because the computers can only recognize numbers data so we need to transform the information into word vectors enabling the computer to calculate.
LSTM Layer will extract features from the word vectors and change their sizes. It can also effectively increase the accuracy of the prediction.
Faster-RCNN Layer will extract features from the images and create vectors of the images.
ATTENTION Layer is used to reduce the dimension of the vectors from the Faster-RCNN Layer, from three-dimension to one-dimension thus make it capable of calculating with word vectors.
Fully Connected Layer will compute the class scores, each neuron in this Layer will be connected to all the numbers in the previous volume, so their activations can hence be computed with a matrix multiplication.
Softmax Layer is used in the process of multi-classification, mapping the output of multiple neurons to the interval of (0, 1), which can be
2019101138 30 Sep 2019 understood as probability to conduct multi-classification. Then the computer can report an answer that has the highest possibility.
The design can complete the extraction of the image information and realize the interaction with the player, thereby improving the game experience of the player.
DESCRIPTION OF DRAWINGS
Figure 1 is the structure of the whole system
Figure 2 is the structure of Encoder in Algorithm module
Figure 3 is the structure of Decoder in Algorithm module
Figure 4(a) and figure 4(b) are the internal structure of Semantic Layer
Figure 5 shows the visual interface
Figure 6 is the procedure of the Training
DESCRIPTION OF PREFERRED EMBODIMENT
Network Design
Figure 1 shows the structure of our neural network. The network has a Faster-RCNN Layer, two semantic Layers, one ATTENTION Layer, three LSTM Layers, one Fully Connected Layer, and a Softmax Layer.
(1) Faster-RCNN Layer
2019101138 30 Sep 2019
The input data of the Faster-RCNN Layer is the images that we gathered from the internet. Then we send the pictures into the Faster-RCNN Layer, and the size of the images becomes (3236x2048).
(2) ATTENTION Layer
After we get the information from the first Layer, we put the information into the ATTENTION Layer. This Layer changes the size of the information from (32x36x2048) to a one-dimensional vector with the size of 2048.
(3) Semantic Layer 1
The information from the questions is sent to the first semantic Layer. Inside the semantic Layer, there are two algorithms —ELMo and GloVe. The data through each algorithm has a size of 300 each, and after passing through these algorithms the data from the two sides will be combined together thus create a whole word vector that has a size of 600. Because the computer can only understand numbers, we have to change the text information into vectors thus enable the computers to understand. In the meantime, we input conversation histories information into the semantic Layer to create word vectors. Because we use the same corpus, the words from the texts are also the same. That is the reason why we use only one semantic Layer.
2019101138 30 Sep 2019 (4) LSTM Layer 1
The word vectors from the question input then go ahead into the
LSTM Layer 1. This Layer transforms the size of the word vectors from 600 to (320x512). The LSTM Layer has an evident advantage: the data sent into LSTM Layer is a completed sentence so the words in the data have continuity, meaning that LSTM Layer can catch the relationship between words, which can raise the accuracy of the prediction.
(5) LSTM Layer 2
We do the same operation to the word vectors which come from the conversation histories. Also, the LSTM Layer 2 transforms the vectors into the size of (320x512).
(6) Fully Connected Layer
Before we put data into the Fully Connected Layer, we combine the vectors which we get in the former states. Add up the three vectors in the same dimension. The dimension after operation will be: 512+512+2048=3072.
After summing up the vectors, the data from three different origins become one vector that consists of all the information before. However, the vector with the size of 3072 is too huge for the computer
2019101138 30 Sep 2019 to calculate, as for the problem of time consuming, we decide to compress the data to reduce the calculating time. Then we put this vector which size is 3072 into the Fully Connected Layer and get a vector that has the size of 512. In order to prevent over-fitting, we employ a dropout operation after the Fully Connected Layer. Later we process the vector by a Tangent function and send the data to the decoder to translate the data into the information that we human can understand.
(7) Semantic Layer 2
After we set up our model, we need to train it. We first get a huge dataset of one hundred answers to each question before. As we know, the computer cannot understand text information, so we must change the data into vectors. The method we achieve this is using a second semantic Layer. With the help of this Layer, we can change the input of answers from (32x10x100x20) to (32000x20x300).
(8) LSTM Layer 3
In order to give the data, which is gathered from the answers, comparability, we have to make the size of the data the same as the size of the data from the encoder. The data will be changed from (32000x20x300) to (32000x512). Because the data from the encoder
2019101138 30 Sep 2019 has only two dimensions, the computer can only calculate data in the same dimension, so we have to reduce dimension of the data from 3 to
2. Then the computer will be able to calculate the data.
(9) Softmax Layer
After we receive data from the third LSTM Layer, we need to interact the data from the Layer and the data outputted by the encoder. We employ an element-wise product to minus the data and sum them up so as to change the vector into one-dimensional from two vectors of (32000x512) to one vector of (32000). Then we unsqueeze the vector back into three-dimensional (32x10x100). After fishing that, we send the data to the Softmax Layer and get the weights of all the 100 answers. With the weights of the questions, we can get a rank of the answers and report a response that has the highest possibility to the users.
Procedure
Stepl: Encoder
Since Computer cannot understand human’s language, we need to translate our language to computer language, which is the process “Encoder”. During this process, we need to encode image, questions from the player and conversation history.
io
2019101138 30 Sep 2019
For the computers are unable to recognize the object in the picture like human, instead, computer recognizes the picture as two-dimension matrix, the difficulties we faced during the research project at first is how to let the computer understand different object in the picture. To solve the problem, we use the Convolutional Layers from Faster R-CNN, which is advanced convolutional neural network in deep learning, to help computer acquire the feature of the computer. Then, in the activate Layer, the feature will be determined the anchor whether it belongs to negative or positive through Softmax algorithm in PRN (Region Proposal Networks), and the bounding box regression will revise the anchors in order to achieve the precise proposal. Afterwards, Pooling Layer will collect the feature map from convolutional Layer and proposal from PRN to compound proposal feature maps. Finally, it will classify the proposal through proposal feature maps and obtain the accurate location.
However, the size of the information from the image is too large to calculate, we need to squeeze it. The Fully Connected Layer solves this problem. After the Fully Connected Layer, the size of the information is squeezed while the main information is still kept.
Another problem in image encoder is that when the input sequence is very long, it is difficult for the model to learn a reasonable vector representation. To solve this problem, we need to introduce Attention
2019101138 30 Sep 2019
Mechanism. Attention Mechanism and human observation of the external Mechanism is very similar. Humans observation focus on the significant part selectively to acquire the major information instead of as a whole. For example, when people notice a person, they usually look at the person’s face then other parts of the body. Afterwards, people will comprise the feature they observed together as a whole impression of this person. Attention Mechanism works nearly the same way as human. Attention Mechanism helps the model give different weight in different input and extract the more crucial information, which make the model have a more accurate decision as well as keep the same calculation and storage. After this process, the image will be squeeze to a valuable tensor for computer to calculate more easily. The next step is to encode questions from the player and conversation history.
The questions from the player and conversation history are comprise of human language, which the computer is difficult to understand. To solve this problem, we need to give every word a vector, which is called Word2vector or semantic Layer in deep learning. However, Word2vector still cannot solve this problem perfectly since the computer cannot recognize the “distance” between words. For example, if player inputs “Asia ice”, which makes nonsense to us, but the computer cannot understand this and cannot output anything. WMD (Word Mover's
2019101138 30 Sep 2019
Distance) can model the distance between two documents as a combination of semantic distances for words in two documents. The computer trains this process for specific time, and will realize the similarity between two words. So, if you input something, the computer will output the most similar answer if there is no answer matches the players’ question.
WMD formula: Ttjc(i, j) i,j=\
Through Word2vector and WMD (Word Mover's Distance), the question and history can be understood by the computer. For each sentence, we only allow twenty words at most. We will add “0” at the end of sentence or delete the extra word if the sentence is not exactly 20 words in one sentence. Then, this sentence will go through the LSTM (Long Short-Term Memory) process. During this process, the computer will filter the useless information, remember the essential information and decide what will output.
The reason why we use LSTM instead of RNN in this process is that RNN has difficulty to remembering the memory long time ago because Gradient exploding is possibly happening in RNN. Besides RNN, LSTM have three controllers (input controller, output controller and lost controller) which have ability to filter the useless information, remember
2019101138 30 Sep 2019 the essential information and output the most efficient things.
Since then, image, question and history are understood by the computer. We need to contact these three together and ready for decoder. Though each separately information is not so big, after connection procedure, it is huge amount of data which the computer cannot deal with. Solving this problem, we need to use the Fully Connected Layer again to squeeze such huge data. That will bring us to decoder eventually.
Step2: Decoder
Figure 3 shows the structure of the decoder procedure. In the decoder module, we will preprocess encoder input data and the answer dataset, change them into the same size. Then combine these two vectors by Answer score module. Finally, use the Softmax excitation function to get the scores.
Firstly, this group changes two datasets into the same size in order to combine them. We collect the Encoder input data, which size is (32x10x512). The batch size is 32, we change the size of (32x10x512) into (32x10x100x512). It means that we test 32 image one time, each image will be asked 10 round and each round has 100 answers. The max length of each answer is 512. Then, we change the input from (32x10x100x512) to (32000x512) by :
2019101138 30 Sep 2019
Batch sizexroundsxnumber of answers=32x 10x100=32000
After that, we should preprocess the answer into vector and change its size. In the beginning, we have to embed the answer by semantic Layer. The internal structure of semantic Layer can be seen in Eigure 4. We set the embedding size as 300. We input the answer into ELMo and Glove model respectively, the output vector from ELMo would go through a Lully Connected Layer to change the size from 1024 to 300. We contact these two 300 dimensions vector into a 600 dimensions word vector. The size of this vector is (32x10x100x20x600) Next, we establish the LSTM Layer3, we put the vector from semantic Layer into this Layer. The size will change from (32000x20x600) to (32000x512).
Secondly, we combine these two data in Answer score module. There are two steps. Stepl: use element-wise product to the encoder input (32000x512) and answer vector (32000x512). The output product vector combines the both two vector characteristics and its size would not change. Step 2: add the 512 rows of the product vector, its size would be (32000x1)
Thirdly, we use Softmax activation function to get the weights of all the 100 answers. The Softmax function is shown as below:
eZi
2019101138 30 Sep 2019
Since we use the supervised learning, we know the right answer, we set the right answer as 1, wrong answers were set as 0.
Fourthly, we use Cross Entropy Loss Function to calculate the loss. Cross Entropy Loss Function describes the distance between two probability distributions, and the smaller the cross entropy is, the closer they are to each other. The formula is shown as below:
C =--[y Ina + (1- y)ln(l — a)]
X
We set ‘a’ as weight for each answer and ‘y’ to be 0 or 1 (0 for wrong answer or 1 for right answer). We can obtain the loss value by using it.
Finally, pass the loss back to change the parameter and retrain the model.
2019101138 30 Sep 2019
The following shows the pseudocode of the training
Algorithm Model Training Process
1: Input: Import Image data I, Question data Q, Answer data A, History data H
2: Set the maximum number of iterations epoch
3: Use Cross Entropy Loss as the loss function loss June
4: Use Stochastic Gradient Descent (SGD) as the optimizer op
5: Feed / to Faster R-CNN to get image feature I\
6: Randomly initialize model weights
Ί: for 1, 2, 3,..., epoch do
8: Encoder:
9: Set Qi <— Send Q to the Embedding layer
10: Set β2 Send Ui to the LSTM layer
11: Set Ηλ <— Send H to the Embedding layer
12: Set //2«— Send H\ to the LSTM layer
13: Set I2 <— Do attention between Q2 and /|
14: Set J <— Contact Q~·, H2 and I2
15: Decoder:
16: Set Αγ <— Send A to the Embedding layer
17: Set A2 <— Send A! to the LSTM layer
18: Set score <— Calculate the score of this answer based on A2 and J
19: Set loss <— Calculate the loss value between the answer and the standard
20: Backward based on loss
21: end for
Testing:
Learn from the work of our predecessors, we use six parameters
R@1,R@5, R@10 , Mean , Mrr and Ndcg as the indexes to evaluate this model.
R: the recall rate, which indicates how many positive examples in the sample are predicted correctly.
R@1: the first question’s recall rate
2019101138 30 Sep 2019
R@5: the first five questions’ recall rate
R@10: the ten questions’ recall rate
Mrr(Mean Reciprocal Rank): it is an internationally common mechanism to evaluate the search algorithm, that is, the first result matches, and the score is 1, the second matching score is 0.5, the NTH matching score is 1/n, and the sentence score without match is O.The final score is the mean of all the scores.
NDCG(Normalized Discounted Cumulative Gain) :
The calculation formula of NDCG at node position n is shown in the figure below, n
N(n) = Zn^(2r0) - l)/log(l +y) z=l
According the result from table 1, we find the recall rate, Mrr and Ndcg is increasing, the value of mean is decreasing, which indicates that with the training of the data, the accuracy of the algorithm is improving. As a result, this whole model is valid. The visual interface is shown in Figure 5
2019101138 30 Sep 2019
Table 1 The results of the module
epoch | R@1 | R@5 | R@10 | Mean | Mrr | Ndcg |
0 | 0.3653 | 0.6546 | 0.7652 | 7.9093 | 0.5049 | 0.4262 |
1 | 0.4121 | 0.724 | 0.8326 | 5.9742 | 0.5567 | 0.4682 |
2 | 0.4441 | 0.7591 | 0.8602 | 5.1486 | 0.5879 | 0.5022 |
3 | 0.456 | 0.775 | 0.8723 | 4.9096 | 0.6005 | 0.5037 |
4 | 0.4698 | 0.7841 | 0.8794 | 4.6355 | 0.6119 | 0.5183 |
5 | 0.4758 | 0.7937 | 0.887 | 4.4394 | 0.6189 | 0.5242 |
6 | 0.4804 | 0.7973 | 0.8895 | 4.3659 | 0.6233 | 0.5272 |
7 | 0.482 | 0.8006 | 0.8926 | 4.2942 | 0.6254 | 0.5332 |
8 | 0.4875 | 0.8025 | 0.8953 | 4.241 | 0.6287 | 0.5355 |
9 | 0.4888 | 0.8054 | 0.8954 | 4.205 | 0.6304 | 0.5408 |
10 | 0.4895 | 0.8057 | 0.896 | 4.1757 | 0.6316 | 0.5468 |
Claims (5)
1. A voice interaction system for race games, wherein the algorithm model is based on the idea of conventional deep-learning model; to be specific, a number of data are collected and categorized into training set in which a simulative training is done in order to study model parameters, validate set and testing set which are the modules being manipulated to assess the model performance.
2. The voice interaction system for race games of claim 1, wherein said design imports dropout in order to avoid over-fitting phenomenon; simultaneously, cross-entropy function is introduced to make betterment in the studying of model and parameter.
3. The voice interaction system for race games of claim 1, wherein said design implements the language interaction system in the realm of RAC (Race Game) and fulfills functions of the design of core algorithm.
4. The voice interaction system for race games of claim 1, wherein said design integrates information from different media (real-time conditions, text messages converted from voice message and so on) and then does interactive processing; an ATTENTION mechanism is adopted to better integrate the main information mentioned above.
5. The voice interaction system for race games of claim 1, wherein said core algorithm model adopts Encoder-Decoder structure to encode and decode information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2019101138A AU2019101138A4 (en) | 2019-09-30 | 2019-09-30 | Voice interaction system for race games |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2019101138A AU2019101138A4 (en) | 2019-09-30 | 2019-09-30 | Voice interaction system for race games |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2019101138A4 true AU2019101138A4 (en) | 2019-10-31 |
Family
ID=68342014
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2019101138A Ceased AU2019101138A4 (en) | 2019-09-30 | 2019-09-30 | Voice interaction system for race games |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2019101138A4 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111045084A (en) * | 2020-01-06 | 2020-04-21 | 中国石油化工股份有限公司 | Multi-wave self-adaptive subtraction method based on prediction feature extraction |
CN113011202A (en) * | 2021-03-23 | 2021-06-22 | 中国科学院自动化研究所 | End-to-end image text translation method, system and device based on multi-task training |
US11397890B2 (en) * | 2017-04-10 | 2022-07-26 | Peking University Shenzhen Graduate School | Cross-media retrieval method based on deep semantic space |
GB2609992A (en) * | 2021-08-10 | 2023-02-22 | Motional Ad Llc | Semantic annotation of sensor data using unreliable map annotation inputs |
-
2019
- 2019-09-30 AU AU2019101138A patent/AU2019101138A4/en not_active Ceased
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11397890B2 (en) * | 2017-04-10 | 2022-07-26 | Peking University Shenzhen Graduate School | Cross-media retrieval method based on deep semantic space |
CN111045084A (en) * | 2020-01-06 | 2020-04-21 | 中国石油化工股份有限公司 | Multi-wave self-adaptive subtraction method based on prediction feature extraction |
CN113011202A (en) * | 2021-03-23 | 2021-06-22 | 中国科学院自动化研究所 | End-to-end image text translation method, system and device based on multi-task training |
CN113011202B (en) * | 2021-03-23 | 2023-07-25 | 中国科学院自动化研究所 | End-to-end image text translation method, system and device based on multitasking training |
GB2609992A (en) * | 2021-08-10 | 2023-02-22 | Motional Ad Llc | Semantic annotation of sensor data using unreliable map annotation inputs |
GB2609992B (en) * | 2021-08-10 | 2024-07-17 | Motional Ad Llc | Semantic annotation of sensor data using unreliable map annotation inputs |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111554268B (en) | Language identification method based on language model, text classification method and device | |
CN110717431B (en) | Fine-grained visual question and answer method combined with multi-view attention mechanism | |
AU2019101138A4 (en) | Voice interaction system for race games | |
CN111930942B (en) | Text classification method, language model training method, device and equipment | |
CN111444709A (en) | Text classification method, device, storage medium and equipment | |
CN111897939B (en) | Visual dialogue method, training method, device and equipment for visual dialogue model | |
CN110704601A (en) | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network | |
CN111105013B (en) | Optimization method of countermeasure network architecture, image description generation method and system | |
CN112148831B (en) | Image-text mixed retrieval method and device, storage medium and computer equipment | |
CN110309850A (en) | Vision question and answer prediction technique and system based on language priori problem identification and alleviation | |
CN111159345A (en) | Chinese knowledge base answer obtaining method and device | |
CN114595306A (en) | Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling | |
CN117216185A (en) | Comment generation method, device, equipment and storage medium for distributed content | |
Chaudhuri | Visual and text sentiment analysis through hierarchical deep learning networks | |
Yuan | [Retracted] A Classroom Emotion Recognition Model Based on a Convolutional Neural Network Speech Emotion Algorithm | |
CN112785039B (en) | Prediction method and related device for answer score rate of test questions | |
M'Charrak | Deep learning for natural language processing (nlp) using variational autoencoders (vae) | |
Cai et al. | Multimodal emotion recognition based on long-distance modeling and multi-source data fusion | |
BENKADDOUR et al. | Hand gesture and sign language recognition based on deep learning | |
Hossain | Deep learning techniques for image captioning | |
CN118093840B (en) | Visual question-answering method, device, equipment and storage medium | |
Sankar | Study of deep learning models on educational channel video from YouTube for classification of Hinglish text | |
CN116050428B (en) | Intention recognition method, device, equipment and storage medium | |
CN117725547B (en) | Emotion and cognition evolution mode identification method based on cross-modal feature fusion network | |
Beltrán | Visual and Textual Common Semantic Spaces for the Analysis of Multimodal Content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |