CN111841007A

CN111841007A - Game control method, device, equipment and storage medium

Info

Publication number: CN111841007A
Application number: CN202010741948.8A
Authority: CN
Inventors: 陈堆盛; 陈柱欣; 丁涵宇; 张星; 林悦
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-10-30

Abstract

The application provides a game control method, a game control device, game control equipment and a storage medium. The method comprises the following steps: acquiring a voice fragment and a user identification for game control; inputting the voice segment into a feature extraction model for processing to obtain first audio feature information of the voice segment; respectively matching the first audio characteristic information with a plurality of second audio characteristic information to obtain target audio characteristic information; the second audio characteristic information is audio characteristic information corresponding to a pre-stored user identifier; determining a target control command according to the target audio characteristic information and the corresponding relation between the audio characteristic information and the control command; and controlling to execute corresponding operation in the game according to the target control command. The method realizes that the player uses voice to quickly realize the control of the game in the game process, is not limited by the regional language or language of the player, simplifies the complexity of the game operation and improves the interaction efficiency.

Description

Game control method, device, equipment and storage medium

Technical Field

The present application relates to the field of game technologies, and in particular, to a method, an apparatus, a device, and a storage medium for controlling a game.

Background

In a game, a player often needs to interact with a virtual world in various ways, but the interaction efficiency is generally considered when designing the game. The efficient interaction mode can bring smooth user experience to the game and improve the competition value of the game. Most of game interaction is carried out in a key touch mode at present, and when a player carries out game interaction, the player needs to locate the position of a key by naked eyes and then drives a finger to click. Along with the gradual expansion of the game playing scale, the types of game keys are more and more, and the interaction efficiency is obviously reduced.

With the development of the related technologies in the field of machine learning, particularly deep learning, the man-machine interaction mode is greatly changed, a voice control interaction mode is introduced into a game, in the related technologies, the voice content of a player is generally recognized through a general voice recognition system, the result is converted into text characters which can be processed by a computer, and then the text characters are extracted through a natural language processing system to obtain the control intention of the player. In the scheme, more than two levels of processing are usually performed, errors are introduced into each level in the link, besides some errors generated by the primary system affect the secondary system, part of errors can be amplified in the secondary system to cause more serious errors, and therefore the manipulation intention of the player cannot be accurately identified.

Disclosure of Invention

The application provides a game control method, device, equipment and storage medium, so as to improve the accuracy of identifying the manipulation intention of a player.

In a first aspect, the present application provides a method for controlling a game, comprising:

acquiring a voice fragment and a user identification for game control;

inputting the voice segment into a feature extraction model for processing to obtain first audio feature information of the voice segment;

respectively matching the first audio characteristic information with a plurality of second audio characteristic information to obtain target audio characteristic information; the second audio characteristic information is pre-stored audio characteristic information corresponding to the user identification;

determining a target control command according to the target audio characteristic information and the corresponding relation between the audio characteristic information and the control command;

and controlling to execute corresponding operation in the game according to the target control command.

In a second aspect, the present application provides a control apparatus for a game, comprising:

the acquisition module is used for acquiring a voice fragment and a user identifier for game control;

the feature extraction module is used for inputting the voice segment into a feature extraction model for processing to obtain first audio feature information of the voice segment;

the matching module is used for respectively matching the first audio characteristic information with a plurality of second audio characteristic information to obtain target audio characteristic information; the second audio characteristic information is pre-stored audio characteristic information corresponding to the user identification;

the processing module is used for determining a target control command according to the target audio characteristic information and the corresponding relation between the audio characteristic information and the control command;

In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method of any one of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of the first aspects via execution of the executable instructions.

In a fifth aspect, embodiments of the present application provide a program, which when executed by a processor, is configured to perform the method according to any one of the first aspect above.

In a sixth aspect, the present application provides a computer program product, which includes program instructions for implementing the method of any one of the first aspect.

The game control method, the game control device, the game control equipment and the game control storage medium are used for acquiring a voice segment for game control and extracting first audio characteristic information of the voice segment; the first audio characteristic information is matched with the plurality of second audio characteristic information of the user respectively to obtain the target audio characteristic information, the control command corresponding to the target audio characteristic information is further determined, the operation corresponding to the control command is executed, the game is quickly controlled by using voice in the game process, the game operation complexity is low, the content of the voice of the user does not need to be identified, the game is not limited by the regional language or language of the player, the control intention of the player can be accurately identified, and the response speed of the game is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating an embodiment of a method for controlling a game provided by the present application;

FIG. 3 is a schematic diagram illustrating an embodiment of a method for controlling a game provided herein;

FIG. 4 is a schematic view of a setup interface of an embodiment of a method provided herein;

FIG. 5 is a schematic diagram of an audio control arrangement according to an embodiment of the method provided herein;

FIG. 6 is a schematic structural diagram of an embodiment of a control device for a game provided in the present application;

fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the present application.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terms "comprising" and "having," and any variations thereof, in the description and claims of this application and the drawings described herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

First, a part of vocabulary and application scenarios related to the embodiments of the present application will be described.

An Automatic Speech Recognition (ASR) system, a suite of systems that automatically convert Speech into computer-understandable character/text forms.

SRE: the speaker recognition system is used for searching a section of voice which is spoken by a speaker in a registered data set and is used for speaker searching.

ASV: a speaker verification system determines which speaker in a registered candidate set said a segment of speech, for use in determining the identity of the speaker.

The DTW algorithm is dynamic time warping, is an algorithm based on a dynamic programming idea, solves the problem of sequence template matching with different pronunciation lengths, and is an early and more classical algorithm in speech recognition.

GMM is a Gaussian mixture model, one probability distribution is quantized by a Gaussian probability density function (normal distribution curve), and the distribution is decomposed into a plurality of models formed based on the Gaussian probability density function (normal distribution curve).

The UBM is a general background model and a large-scale GMM model, is obtained by carrying out unsupervised training on a large amount of data and is used for representing a statistically key distribution form of the data in a feature space.

The self-encoder is a neural network used in semi-supervised learning and unsupervised learning, and has the function of performing characterization learning on input information by taking the input information as a learning target.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, the system architecture of the embodiment of the present application may include, but is not limited to: a terminal device 11 and a server 12. The terminal device 11 includes, for example: user equipment such as cell-phones, panel computer, personal computer.

The terminal device 11 and the server 12 may be connected via a network.

One or more terminal devices may be included in the system architecture, and one terminal device is illustrated in fig. 1 as an example. The server may be a server of a game, and the terminal device may run a game application.

The method provided by the embodiment of the present application may be implemented at a server side, or may also be implemented at a terminal device, for example, the method of the embodiment of the present application may send the obtained user voice to the server for processing, or the function of the method may be implemented in a game application and run on the terminal device, and further, when the terminal device runs the game application, the method may also be implemented by performing data interaction with the server.

In the related art, one way is: the voice content of the player is recognized through the universal voice recognition system, the result is converted into text characters which can be processed by a computer, and then the text characters are extracted through the natural language processing system to obtain the control intention of the player. In the scheme, more than two levels of processing are usually performed, errors are introduced into each level in the link, besides some errors generated by the primary system affect the secondary system, part of errors can be amplified in the secondary system to cause more serious errors, and therefore the manipulation intention of the player cannot be accurately identified.

The other interactive mode is to adopt a customized voice interactive command for the game, and the system only gives feedback when the player completely sends out the same instruction originally solidified by the system, and actually, the interactive mode is to prevent the introduction of errors by simplifying the natural language processing system of the second layer on the basis of the first mode, thereby improving the accuracy of the operation, but in the mode, because the instruction is solidified by a game developer in the development link, the player can not carry out personalized customization according to the own requirement, and simultaneously, the number of available instructions is less.

Meanwhile, the two interaction modes have the serious defect that the two interaction modes need to pass through the process of converting the voice into the text of the first layer, and the existing voice recognition system solutions need to be constructed aiming at the language of a speaker (such as Chinese, English and Japanese …), so that the solutions need to construct versions of different languages according to the nationality of a player, which can increase the workload of system designers; furthermore, for some small and popular languages or dialects, etc., none of these approaches can achieve coverage to players in these areas if there is no speech recognition system for this language at present.

According to the method, the audio features of the recorded user voice are extracted and matched with the pre-stored audio features of the user, the target audio features are obtained, the control command corresponding to the target audio features is determined, and then corresponding operation is executed.

The pre-stored audio features of the user correspond to control commands, and one control command may correspond to one or more audio features.

The method of the embodiment of the application realizes game control by utilizing voice, has small operation complexity, is not restricted by regional languages and languages of players, and improves the game interaction efficiency.

The control method of the game in one embodiment of the present disclosure may be executed on a terminal device or a server. The terminal device may be a local terminal device. When the control method of the game is operated on the server, the method can be implemented and executed based on a cloud interaction system, wherein the cloud interaction system comprises the server and the client device.

In an optional embodiment, various cloud applications may be run under the cloud interaction system, for example: and (5) cloud games. Taking a cloud game as an example, a cloud game refers to a game mode based on cloud computing. In the cloud game operation mode, the game program operation main body and the game picture presentation main body are separated, the storage and operation of the game control method are completed on the cloud game server, and the client device is used for receiving and sending data and presenting the game picture, for example, the client device can be a display device with a data transmission function close to a user side, such as a mobile terminal, a television, a computer, a palm computer and the like; however, the terminal device performing the information processing is a cloud game server in the cloud. When a game is played, a player operates the client device to send an operation instruction to the cloud game server, the cloud game server runs the game according to the operation instruction, data such as game pictures and the like are encoded and compressed, the data are returned to the client device through a network, and finally the data are decoded through the client device and the game pictures are output.

In an alternative embodiment, the terminal device may be a local terminal device. Taking a game as an example, the local terminal device stores a game program and is used for presenting a game screen. The local terminal device is used for interacting with the player through a graphical user interface, namely, a game program is downloaded and installed and operated through an electronic device conventionally. The manner in which the local terminal device provides the graphical user interface to the player may include a variety of ways, for example, it may be rendered for display on a display screen of the terminal or provided to the player through holographic projection. For example, the local terminal device may include a display screen for presenting a graphical user interface including a game screen and a processor for running the game, generating the graphical user interface, and controlling display of the graphical user interface on the display screen.

The technical solution of the present application will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 2 is a schematic flowchart of an embodiment of a game control method provided in the present application. As shown in fig. 2 and fig. 3, the method provided by this embodiment includes:

step 101, obtaining a voice segment and a user identification for game control.

Specifically, the terminal device of the user runs the game application, acquires a voice segment for game control, for example, starts a recording function during the game, records the voice of the user, and acquires the voice segment for game control from the recorded voice of the user, for example, before the voice segment for game control is sent, the user operates to trigger the start of the recording function, for example, by an operation of a wakeup word, a key, or the like, or starts the recording function all the time during the game running, and intercepts the voice segment for game control from the recorded voice of the user, for example, the voice segment is preceded by a special prompt sound, which indicates that the user starts sending the voice segment for game control, or the like.

The user identifier is obtained from, for example, login information of the user, such as a user name, a user account, and the like.

And 102, inputting the voice segment into the feature extraction model for processing, and acquiring first audio feature information of the voice segment.

Specifically, the voice segment is input into the feature extraction model for processing, and first audio feature information of the voice segment is obtained. The feature extraction model may be obtained by training a pre-established machine learning model.

Generally, the extracted audio characteristic information can distinguish different pronunciations of audio, such as audio with strong distinguishability: for the same pronunciation of audio, the spatial distance of the audio feature information should be as close as possible, while for different pronunciations of audio, the spatial distance of the audio feature information should be as far as possible; the feature extraction model has strong robustness: the player may be in various complex environments, and the feature extraction model is required to have the capability of resisting environmental interference, and the extracted audio feature information should be as close to each other in space as possible for the pronunciation of the same content.

Optionally, in order to improve accuracy of subsequent processing, for example, accuracy of feature matching, as shown in a registration link and a verification link shown in fig. 3, the qualification detection may be performed on the obtained voice segment to detect whether a preset voice qualification condition is met, for example, whether an effective voice duration is greater than a preset duration, and/or whether an effective voice ratio is greater than a preset value. And if the qualification detection does not pass, the voice segment is obtained again. If the speech segment qualification test passes, the step 102 is executed.

103, respectively matching the first audio characteristic information with a plurality of second audio characteristic information to obtain target audio characteristic information; the second audio characteristic information is audio characteristic information corresponding to a pre-stored user identifier;

in an embodiment, step 103 is preceded by obtaining a plurality of second audio characteristic information stored in advance and corresponding to the user identifier.

Specifically, the player needs to register different voice clips corresponding to the control command in advance, for example, one control command corresponds to a plurality of voice clips, and one control command corresponds to one operation in the game, for example, the operation of controlling running, jumping, attacking and the like of a virtual character in the game.

The audio feature information of each voice clip corresponding to each control command is extracted through the feature extraction model, and the audio feature information of a plurality of voice clips corresponding to the player is stored, for example, in a database, as shown in a registration process in fig. 3. Further, the corresponding relation between the control command and the audio characteristic information can be stored.

The voice segment registered by the player does not require the text content of the voice and the language used, i.e. the player can register by using any language and dialect, even any sound which is not related to linguistics, such as animal cry, etc. Therefore, the method of the embodiment of the application is not limited by the region, language and language of the player, and the player can highly customize the control command.

As shown in fig. 3, in the verification step, a plurality of pieces of second audio feature information stored in advance are obtained, the first audio feature information is matched with the plurality of pieces of second audio feature information, and the target audio feature information is obtained from the plurality of pieces of second audio feature information, for example, one piece of second audio feature information is selected as the target audio feature information, for example, the first audio feature information is a vector a, the second audio feature information includes vectors B to K, the vector a is most matched with the vector F, and if the similarity is highest, the target audio feature information is a vector F.

And step 104, determining a target control command according to the target audio characteristic information and the corresponding relation between the audio characteristic information and the control command.

Specifically, when the player registers the voice segments corresponding to different control commands, the audio feature information extracted from the voice segments corresponding to different control commands may be stored, for example, the correspondence between the audio feature information and the control commands may be stored. The correspondence may be stored in the terminal device or in the server.

And searching a target control command corresponding to the target audio characteristic information from the corresponding relation according to the target audio characteristic information. In an embodiment, the correspondence includes a plurality of control commands corresponding to the user identification and at least one piece of audio feature information corresponding to each control command.

Wherein each control command may correspond to one or more speech segments, and thus each control command may correspond to at least one audio feature information.

And 105, controlling to execute corresponding operation in the game according to the target control command.

Specifically, after the target control command is determined, corresponding operations are controlled to be executed in the game according to the target control command, for example, virtual characters in the game are controlled to run, jump and the like.

The method comprises the steps of acquiring a voice segment for game control, and extracting first audio characteristic information of the voice segment; the first audio characteristic information is matched with the plurality of second audio characteristic information of the user respectively to obtain the target audio characteristic information, the control command corresponding to the target audio characteristic information is further determined, the operation corresponding to the control command is executed, the game is quickly controlled by using voice in the game process, the game operation complexity is low, the content of the voice of the user does not need to be identified, the game is not limited by the regional language or language of the player, the control intention of the player can be accurately identified, and the response speed of the game is improved.

On the basis of the above embodiment, the "acquiring a voice segment for game control" in step 101 may be specifically implemented as follows:

when detecting that a preset awakening word appears in the recorded user voice, starting to intercept the voice, and when meeting a preset condition, stopping intercepting the voice to obtain the voice fragment;

wherein the preset conditions include: and intercepting the time length to reach a first preset time length, or detecting that the second preset time length has no effective voice.

Specifically, in the process of game operation, in order to improve the interaction efficiency, the recording function can be kept on, the voice of the user is continuously recorded, in order to avoid the occurrence of situations such as misoperation, a wakeup word can be preset, the voice is intercepted when the wakeup word is detected to appear, and the voice is intercepted when a preset condition is met, so that the voice segment for game control is obtained.

Wherein the awakening word is different from the voice frequently spoken by the player for game control, for example, to improve the recognition.

For example, stopping intercepting the voice when the intercepting duration reaches a first preset duration, or stopping intercepting the voice when detecting that the second preset duration has no effective voice.

The first preset time period and the second preset time period may be the same or different.

In the above embodiment, the effective voice segment for game control is obtained through the wake-up word, so that misoperation is avoided, and the accuracy of control is improved.

In an embodiment, before matching, it is necessary to set a voice segment corresponding to each operation (i.e., control command) in the game in advance, and the method of this embodiment includes:

providing a graphical user interface through terminal equipment, wherein the graphical user interface at least comprises an audio control setting control;

responding to the operation of the user on the audio control setting control, acquiring a voice fragment of the user, and performing audio control setting; wherein each operation of the user in the game can be controlled by one or more voice segments.

Specifically, for each operation that the player wants to customize, the player can register 1 or more voice segments with the same or similar pronunciation for identifying the operation, for example, register 1-3 voice segments, and during the interaction process of the subsequent player, for example, the player can perceive the corresponding operation intention of the player and give feedback when uttering the same or similar voice segments.

As shown in fig. 4, a user terminal device runs a game application, a graphical user interface is rendered on a display of the terminal device, the graphical user interface at least includes an audio control setting control, an operation interface of a game scene is displayed in the graphical user interface, for example, the audio control setting control is used for performing audio control setting, that is, audio segments corresponding to different operations are obtained, and then audio feature information of different audio segments is extracted.

The player operates the audio control setting control, such as clicking operation, sliding operation and the like, starts the recording function, obtains the voice clip of the user, and performs audio control setting, that is, each operation in the game can be controlled by one or more voice clips.

In this way, the player can customize the voice clips corresponding to the operations in the games, and in order to improve the matching accuracy, the voice clips corresponding to different operations are different, that is, the difference of the audio characteristic information is large.

In an embodiment, as shown in fig. 3, in the registration step, the step "responding to the operation of the user on the audio control setting control, acquiring the voice segment of the user, and performing audio control setting" may be implemented in the following manner:

in response to a user operation of the audio control setting control, displaying at least one operation of the game on the graphical user interface;

starting a recording function according to the operation selected by the user to acquire a voice fragment of the user;

inputting the voice segments into the feature extraction model for processing to obtain audio feature information of the voice segments;

and storing the audio characteristic information and the control command corresponding to the operation to obtain a corresponding relation.

Specifically, at least one operation of the game is displayed on the graphical user interface in response to the operation of the audio control setting control by the user, for example, the user clicks the audio control setting control; the operation displayed may be a screen of the operation in the game and/or a name of the operation, which is not limited in the embodiment of the present application.

As shown in fig. 5, operations such as running, squatting, getting up, jumping and the like of the virtual character are displayed on the interface, and if a user selects the operations, for example, selects running, the recording function is started to obtain a voice clip of the user; inputting the voice segments into the feature extraction model for processing to obtain audio feature information of the voice segments; and storing the audio characteristic information and the control command corresponding to the operation to obtain the corresponding relation.

In the embodiment, the control commands operated in the game are registered in advance, the audio characteristic information corresponding to different control commands is stored, the player can define the voices corresponding to various control commands by self, and the accuracy and the efficiency of control in the game operation process are improved.

In one embodiment, in order to improve the accuracy of the subsequent feature matching and filter invalid speech segments, before inputting the speech segments into the feature extraction model for processing, the method further includes:

determining whether the voice fragment meets a preset voice qualified condition, wherein the voice qualified condition comprises that the time length is greater than a preset time length and/or the effective voice ratio is greater than a preset value;

and if the voice segment meets the voice qualification condition, inputting the voice segment into a feature extraction model for processing.

In one embodiment, to ensure the accuracy of the post-stage feature matching, the length of the audio feature must satisfy the minimum length constraint, i.e., the valid speech duration is greater than the preset valid duration.

In an embodiment, a part of invalid voice segments in the voice segments are removed before feature extraction, and in order to ensure the accuracy of the feature matching at the later stage, the ratio of valid voice needs to be greater than a preset value.

If the at least one condition is met, determining that the qualification detection is passed, transmitting the voice segment to the feature extraction model, and if the qualification detection is not passed, requiring re-recording.

In this embodiment, a voice activity detection algorithm (VAD) may be used to detect valid speech, and the VAD can perform framing detection on a speech segment according to different points in feature distribution between target speech (i.e., valid speech) and non-target speech (i.e., invalid speech), so as to filter out invalid speech.

The voice activity detection algorithm can be obtained by training through training data, for example, energy, zero-crossing rate, spectrogram frequency band distribution and the like can be selected as features for distinguishing target voice and non-target voice, GMM models are separately constructed for the target voice and the non-target voice according to marked voice segments, and the two GMM models are trained. When a certain voice segment needs to be detected, the probabilities of the voice segment in the two GMM models are respectively calculated, for example, if the output probability of the voice segment in the GMM model corresponding to the target voice is relatively high, the voice segment is determined to be the target voice, that is, the effective voice.

In the embodiment, the speech qualification is detected before the feature extraction, so that the accuracy of subsequent feature matching is improved, and invalid speech segments can be filtered.

In one embodiment, whether the extracted features are completely independent of the speaker's local language determines the final effect of the system. The feature extraction model in this embodiment can be implemented by the following methods: one implementation is as follows:

the feature extraction model is a model obtained by training a universal background model UBM by using a label-free data set, and is used for acquiring the probability of an input voice fragment on each Gaussian distribution in the UBM as audio feature information;

specifically, in this way, unsupervised data can be used for training, which is relatively low in cost, but has a high requirement on the amount of training data. The input of the model can select the qualified voice segment to be subjected to audio feature extraction.

And in the training stage, a universal background model UBM is trained by using a label-free data set, and Gaussian distribution of the whole data set in a feature space is constructed. And an application stage, calculating the probability of each frame of the input speech segment on each Gaussian distribution in the UBM model, wherein the probability is used as the output of the feature extraction model, namely the audio feature information of the speech segment.

The other realization mode is as follows:

the feature extraction model is obtained by training an AE model of the self-encoder by using a label-free data set, and is used for acquiring the output of an input voice segment on a middle coding layer of the AE model as audio feature information;

specifically, in this way, unsupervised data can be used for training, which is relatively low in cost, but has a high requirement on the amount of training data.

In the training stage, a self-encoder model AE model is trained by using a label-free data set, and optionally, a middle hidden layer (i.e., a middle encoding layer) of the self-encoder is designed to be a compression mode (i.e., encoding layer dimension < encoder input layer dimension) to obtain sparse characteristics, so that the encoding layer can extract key characteristics representing input voice segments. And in the application stage, after each frame of the input voice segment passes through the self-encoder, the output of the middle encoding layer is extracted and used as the output of the feature extraction model.

In yet another implementation:

the feature extraction module is a model obtained by training a neural network model by using a labeled data set, and the feature extraction module is used for acquiring classification information of input voice segments at different depths as the audio feature information.

Specifically, in the method, supervised data are used for training, and the data are labeled, so that the cost is relatively high, but the data volume for training can be small.

A training phase for training one or more deep neural network models using the tagged data set such that the network has the ability to classify features of the input speech segments. Here, the classification capability includes a phoneme classification (which may be implemented by constructing an ASR system), an audio class classification (which may be implemented by constructing an audio event detection system), a speaker classification (which may be implemented by constructing an SRE or an ASV system), where the audio class includes, for example, animal voices, human voices, action sounds (such as clapping hands, etc.), and the speaker classification includes, for example, gender classification, age interval classification, etc., so that the neural network has a capability of extracting key attributes representing voices (the key attributes include content information, audio class information, and a classification corresponding to the speaker information), and the strength of the representing capability of the related attribute capability can be controlled by selecting hidden layer outputs at different depths in the deep neural network as outputs of the feature extraction model. In the application stage, the speech segment is input into the trained neural network model, and the output of the output layer or the hidden layer of the neural network model is extracted as the output of the feature extraction model.

In the embodiment, the audio features of the voice are extracted through the feature extraction model obtained through training, and the extracted audio features are irrelevant to the region, language and the like of the speaker, so that the control intention of the player can be accurately identified, the response speed of the game is improved, and the user experience is improved.

In an embodiment, the matching may be specifically performed as follows:

one implementation is as follows:

respectively matching the first audio characteristic information with the plurality of second audio characteristic information through a characteristic matching algorithm to obtain a plurality of characteristic matching scores;

and taking the second audio feature information with the highest score in the plurality of feature matching scores as the target audio feature information.

Specifically, a feature matching algorithm is used to perform matching processing on the first audio feature information and the plurality of second audio feature information respectively to obtain a plurality of feature matching scores, where the feature matching scores are, for example, similarities between two pieces of audio feature information, and are expressed by, for example, euclidean distance, cosine distance, and the like;

and selecting the second audio characteristic information with the highest score from the second audio characteristic information corresponding to the plurality of characteristic matching scores as the target audio characteristic information.

The other realization mode is as follows:

respectively matching the first audio characteristic information with a plurality of second audio characteristic information through a characteristic matching algorithm to obtain a plurality of characteristic matching scores;

and taking the second audio feature information with the highest score and the score higher than a preset threshold value in the plurality of feature matching scores as the target audio feature information.

and selecting the second audio characteristic information with the highest score from the second audio characteristic information corresponding to the plurality of characteristic matching scores as the target audio characteristic information, so as to avoid misoperation and improve the accuracy, wherein the highest score is higher than a preset threshold, and if the highest score is lower than the preset threshold, the voice segment for game control is recorded again.

Optionally, the feature matching algorithm aligns two pieces of unequal-length audio feature information to be matched, calculates the shortest distance between the two pieces of audio feature information according to the alignment result, and normalizes the result to obtain the similarity between the two pieces of audio feature information.

In one embodiment, the feature matching algorithm includes a Dynamic Time Warping (DTW) algorithm.

Specifically, the DTW algorithm can align the time sequences of two features with different lengths and obtain an optimal matching path. In one embodiment, in order to ensure that the final matching score is between 0 and 1, in the matching process of the DTW algorithm, when the similarity between two features is calculated, a cosine distance is selected instead of a euclidean distance for calculation, and when the optimal matching path is subjected to post-processing, smoothing is performed according to the total matching times of the matching path. In order to improve the speed of the DTW algorithm, according to the pronunciation characteristics of the voice and the characteristic that the matching path of the similar voice approximately meets the diagonal rule, a window is used for limiting the search space of the DTW algorithm in the process of searching the DTW algorithm, so that a faster and more stable result is obtained.

In an embodiment, optionally, the audio feature information may be added with feature information of a speaker, and a feature matching score corresponding to the feature information of the speaker is obtained by the following method: in order to match features of different lengths output by the feature extraction model, a convolutional neural network structure is used for further extracting deeper feature representation, a pooling layer is used for changing the feature representation into feature vectors of fixed dimensions, and a (PLAD) algorithm is selected to calculate a log likelihood score between two final feature vectors to be matched as a feature matching score.

Whether the voice clip is the voice clip sent by the same player corresponding to the user identification can be verified through the characteristic information of the speaker, and the verification can effectively prevent the condition that other people in the environment send voices similar to a certain control command to cause false triggering operation in the game playing process of the player.

In one embodiment, the characteristic information of the speaker in the audio characteristic information is generally separable from the characteristic information of the content of the voice segment, and the characteristic information related to the speaker can be shielded if the speaker verification is not required.

In one embodiment, the feature matching score obtained by the DTW algorithm and the feature matching score of the speaker information matching can be used as the final feature matching score.

In the embodiment, the audio characteristic information used for game control and the audio characteristic information corresponding to the user are matched through the characteristic matching algorithm, the control command corresponding to the audio characteristic information is determined, and the matching accuracy is improved, so that the control intention of the player can be accurately identified, and the game response speed and the user experience of the player are improved.

To sum up, the method of the embodiment of the application can enable a player to use voice to realize quick control of the game in the game process, does not need operations such as retrieval and clicking of keys, is not limited by regional languages or languages of the player, simplifies complexity of game operation, and improves response speed of the game, so that game interaction efficiency is improved, and voice corresponding to a user-defined control command of the player is supported, flexibility is high, and user experience is improved.

Fig. 6 is a structural diagram of an embodiment of a control device of a game provided in the present application, and as shown in fig. 6, the control device of the game of the present embodiment includes:

an obtaining module 601, configured to obtain a voice segment and a user identifier for game control;

a feature extraction module 602, configured to input the voice segment into a feature extraction model for processing, and obtain first audio feature information of the voice segment;

a matching module 603, configured to match the first audio feature information with multiple second audio feature information, respectively, to obtain target audio feature information; the second audio characteristic information is pre-stored audio characteristic information corresponding to the user identification;

a processing module 604, configured to determine a target control command according to the target audio feature information and a corresponding relationship between the audio feature information and the control command;

In a possible implementation manner, the correspondence includes a plurality of control commands corresponding to the user identifier and at least one piece of audio feature information corresponding to each control command.

In a possible implementation manner, the obtaining module 601 is specifically configured to:

In a possible implementation manner, the matching module 603 is specifically configured to:

In a possible implementation manner, the processing module 604 is further configured to:

responding to the operation of the user on the audio control setting control, acquiring a voice segment of the user, and performing audio control setting; wherein each operation of the user in the game can be controlled by one or more voice segments.

In a possible implementation manner, the processing module 604 is specifically configured to:

in response to user operation of the audio control setting control, displaying at least one operation of the game on the graphical user interface;

starting a recording function according to the operation of a user to obtain a voice fragment of the user;

inputting the voice segment into a feature extraction model for processing to obtain audio feature information of the voice segment;

and storing the audio characteristic information and the control command corresponding to the operation to obtain the corresponding relation.

In a possible implementation manner, the feature extraction module 602 is specifically configured to:

determining whether the voice fragment meets a preset voice qualified condition, wherein the voice qualified condition comprises that the effective voice time length is greater than a preset effective time length and/or the effective voice ratio is greater than a preset value;

In a possible implementation manner, the feature extraction model is a model obtained by training a universal background model UBM model with a label-free data set, and the feature extraction model is used for acquiring the probability of an input speech segment on each gaussian distribution in the UBM model as audio feature information;

alternatively, the first and second electrodes may be,

In one possible implementation, the feature matching algorithm includes a Dynamic Time Warping (DTW) algorithm.

The apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 7 is a block diagram of an embodiment of an electronic device provided in the present application, and as shown in fig. 7, the electronic device includes:

a processor 701, and a memory 702 for storing executable instructions for the processor 701.

The above components may communicate over one or more buses.

The processor 701 is configured to execute the corresponding method in the foregoing method embodiment by executing the executable instruction, and the specific implementation process of the method may refer to the foregoing method embodiment, which is not described herein again.

The electronic device may be a terminal device or a server. Optionally, the electronic device may further include: a display 703 for displaying a graphical user interface.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method in the foregoing method embodiment is implemented.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of controlling a game, the method comprising:

acquiring a voice fragment and a user identification for game control;

2. The method of claim 1, wherein the correspondence comprises a plurality of control commands corresponding to the user identifier and at least one piece of audio feature information corresponding to each control command.

3. The method of claim 1, wherein the obtaining a voice snippet for game control comprises:

4. The method according to any one of claims 1 to 3, wherein the matching the first audio feature information with a plurality of second audio feature information to obtain target audio feature information comprises:

5. The method according to any one of claims 1 to 3, wherein the matching the first audio feature information with a plurality of second audio feature information to obtain target audio feature information comprises:

6. The method according to any one of claims 1 to 3, further comprising:

7. The method of claim 6, wherein the obtaining a user's voice segment and performing audio control setting in response to user operation of the audio control setting control comprises:

8. The method according to any of claims 1-3 or 7, wherein before said inputting the speech segment into a feature extraction model for processing, further comprising:

9. The method according to any one of claims 1-3 or 7,

the feature extraction model is obtained by training a Universal Background Model (UBM) by using a label-free data set, and is used for acquiring the probability of an input voice fragment on each Gaussian distribution in the UBM as audio feature information;

alternatively, the first and second electrodes may be,

10. The method of claim 4, wherein the feature matching algorithm comprises a Dynamic Time Warping (DTW) algorithm.

11. A control device for a game, comprising:

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.

13. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-7 via execution of the executable instructions.