CN117953895A - Interactive control method, device, equipment and storage medium based on voice data - Google Patents

Interactive control method, device, equipment and storage medium based on voice data Download PDF

Info

Publication number
CN117953895A
CN117953895A CN202211352352.4A CN202211352352A CN117953895A CN 117953895 A CN117953895 A CN 117953895A CN 202211352352 A CN202211352352 A CN 202211352352A CN 117953895 A CN117953895 A CN 117953895A
Authority
CN
China
Prior art keywords
semantic
data
control instruction
vocabulary
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211352352.4A
Other languages
Chinese (zh)
Inventor
汪洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211352352.4A priority Critical patent/CN117953895A/en
Publication of CN117953895A publication Critical patent/CN117953895A/en
Pending legal-status Critical Current

Links

Abstract

The application provides an interactive control method, device, equipment and storage medium based on voice data; the method comprises the following steps: acquiring text data used as training samples, wherein the text data comprises a plurality of control instruction texts; carrying out semantic structuring processing on the text data to obtain semantic structure data of each control instruction; acquiring a weight value corresponding to each control instruction; labeling the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data; and training a semantic understanding model based on weighted semantic data, wherein the trained semantic understanding model is used for converting voice data into text data and identifying a control instruction corresponding to the text data. The application can improve the accuracy of interactive control of the terminal equipment through the voice data.

Description

Interactive control method, device, equipment and storage medium based on voice data
Technical Field
The present application relates to artificial intelligence technologies, and in particular, to a method, apparatus, device, and storage medium for interactive control based on voice data.
Background
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. Key technologies for the speech technology (Speech Technology) are an automatic speech recognition technology and a speech synthesis technology, and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and the voice becomes one of the best human-computer interaction modes in the future.
In the related technology, the key words of the control instruction are identified through the physical spectrum characteristics of the sound, so that the control instruction corresponding to the sound is determined, and the terminal equipment executes the control instruction to realize the interactive control of the terminal equipment. The method relies on offline voice keyword recognition, and the matching rate of keywords is not high, so that the accuracy of voice interaction control is affected.
In the related art, aiming at improving the accuracy of the interactive control of the voice data to the terminal equipment, a better technical scheme is not available.
Disclosure of Invention
The embodiment of the application provides an interactive control method and device based on voice data, electronic equipment, a computer readable storage medium and a computer program product, which can improve the accuracy of interactive control of terminal equipment through the voice data.
The technical scheme of the embodiment of the application is realized as follows:
The embodiment of the application provides an interactive control method based on voice data, which comprises the following steps:
Acquiring text data used as training samples, wherein the text data comprises a plurality of control instruction texts;
carrying out semantic structuring processing on the text data to obtain semantic structure data of each control instruction;
acquiring a weight value corresponding to each control instruction;
Labeling the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data;
Training the semantic understanding model based on the weighted semantic data, wherein the trained semantic understanding model is used for converting the voice data into text data and identifying control instructions corresponding to the text data.
The embodiment of the application provides an interactive control method based on voice data, which comprises the following steps:
displaying a virtual scene in a human-computer interaction interface;
Acquiring voice data;
Invoking a semantic understanding model based on the voice data to carry out semantic recognition processing, and determining a control instruction corresponding to the voice data, wherein the semantic understanding model is obtained through training by the interactive control method based on the voice data;
And executing the control instruction.
The embodiment of the application provides an interaction control device based on voice data, which comprises:
A sample acquisition module configured to acquire text data for use as training samples, wherein the text data includes a plurality of control instruction texts;
the sample processing module is configured to perform semantic structuring processing on the text data to obtain semantic structure data of each control instruction;
the sample processing module is configured to acquire a weight value corresponding to each control instruction;
the sample processing module is configured to label the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data;
The model training module is configured to train the semantic understanding model based on the weighted semantic data, wherein the trained semantic understanding model is used for converting the voice data into text data and identifying control instructions corresponding to the text data.
The embodiment of the application provides an interaction control device based on voice data, which comprises:
The display module is configured to display the virtual scene in the human-computer interaction interface;
The voice acquisition module is configured to acquire voice data;
the recognition module is configured to call a semantic understanding model based on the voice data to carry out semantic recognition processing and determine a control instruction corresponding to the voice data, wherein the semantic understanding model is obtained through training by the interactive control method based on the voice data in the embodiment of the application;
the display module is further configured to execute the control instruction.
An embodiment of the present application provides an electronic device, including:
A memory for storing computer executable instructions;
And the processor is used for realizing the interactive control method based on the voice data when executing the computer executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium, which stores computer executable instructions for causing a processor to execute, thereby realizing the interactive control method based on voice data.
The embodiment of the application provides a computer program product, which comprises a computer program or a computer executable instruction, wherein the computer program or the computer executable instruction realizes the interactive control method based on voice data provided by the embodiment of the application when being executed by a processor.
The embodiment of the application has the following beneficial effects:
The text of the control instruction is converted into semantic structure data, the semantic structure is marked with a weight value, and weighted semantic data is generated based on the language structure data marked with the weight value. By structuring the text data, the accuracy of labeling of training samples is improved, the accuracy of training semantic understanding models is improved, the accuracy of semantic recognition models for recognizing semantics and control instructions in the interactive control process is further improved, and computing resources required in the interactive control process are saved.
Drawings
Fig. 1 is an application mode schematic diagram of an interactive control method based on voice data according to an embodiment of the present application;
Fig. 2A is a schematic structural diagram of a server according to an embodiment of the present application;
FIG. 2B is a schematic diagram of a semantic understanding model provided by an embodiment of the present application;
fig. 3A to fig. 3G are schematic flow diagrams of an interactive control method based on voice data according to an embodiment of the present application;
fig. 4A to fig. 4C are schematic diagrams of a man-machine interaction interface corresponding to a terminal device according to an embodiment of the present application;
Fig. 4D is a schematic flow chart of an interactive control method based on voice data according to an embodiment of the present application;
FIG. 5 is a flowchart of an interactive control method based on voice data according to an embodiment of the present application;
FIG. 6A is a flowchart illustrating an interactive control method based on voice data according to an embodiment of the present application;
fig. 6B is a schematic diagram of an application scenario provided in an embodiment of the present application;
FIG. 6C is a flowchart illustrating an interactive control method based on voice data according to an embodiment of the present application;
FIG. 6D is a schematic diagram of training a speech recognition model in an embodiment of the application;
Fig. 6E and fig. 6F are schematic flow diagrams of an interactive control method based on voice data according to an embodiment of the present application;
FIG. 7A is a schematic diagram of a data structure of an ASR speech recognition engine according to an embodiment of the application;
FIG. 7B is a schematic diagram of a data structure of a semantic understanding model according to an embodiment of the present application;
FIG. 7C is a table of comparison of effects of embodiments of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are intended to be within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
It should be noted that, in the embodiments of the present application, related data such as user information, user feedback data, etc. are related, when the embodiments of the present application are applied to specific products or technologies, user permission or meaning needs to be obtained, and collection, use, and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing the embodiments only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) Automatic speech recognition (Automatic Speech Recognition, ASR): a technique for converting human speech to text, the goal being to convert the lexical content in human language to computer readable inputs, such as keys, binary codes, or character sequences.
2) Semantics: meaning of a word, a symbol, an action, etc. Semantic understanding is the parsing of text into structured, machine-readable intent and word Slot (Slot) information through a series of AI algorithms.
3) Natural language understanding (Natural Language Understanding, NLU): is a generic term for all method models or tasks that support the understanding of text content by a machine.
4) Classification model: and predicting which class the object given later belongs to by the classification basis and the specific classification class in the sample data, which is a classification model. For example: based on the samples, probabilities that the samples belong to different types are predicted.
The embodiment of the application provides an interaction control method based on voice data, an interaction control device based on voice data, electronic equipment, a computer readable storage medium and a computer program product, which can improve the accuracy of interaction control on terminal equipment through the voice data.
The following describes exemplary applications of the electronic device provided by the embodiments of the present application, where the electronic device provided by the embodiments of the present application may be implemented as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), a vehicle-mounted terminal, and other various types of user terminals, and may also be implemented as a server. In the following, an exemplary application when the electronic device is implemented as a server will be described.
Referring to fig. 1, fig. 1 is a schematic diagram of an application mode of an interactive control method based on voice data according to an embodiment of the present application; by way of example, fig. 1 relates to training server 200-1, speech recognition server 200-2, network 300, and terminal device 400. The training server 200-1 communicates with the speech recognition server 200-2 via the network 300, or otherwise, and the terminal device 400 is connected to the speech recognition server 200-2 via the network 300, where the network 300 may be a wide area network or a local area network, or a combination of both.
In some embodiments, a gaming application is run in the terminal device 400, such as: card game; the speech recognition server 200-2 is a server of the game platform, and is operated with a speech recognition service, and the training server 200-1 generates corresponding training samples based on the text of the control instructions in the game, trains the semantic understanding model, and the user may be a player, as will be described below with reference to the above examples.
By way of example, the training server 200-1 trains the interactive control method of voice data according to the embodiment of the present application to obtain a semantic understanding model, synchronizes the semantic understanding model to the voice recognition server 200-2, when the terminal device 400 receives the voice sent by the user, converts the voice signal into language data and sends the language data to the voice recognition server 200-2, the voice recognition server 200-2 invokes the semantic understanding model to determine the meaning and the control command corresponding to the voice based on the voice data, sends the control command and the meaning to the terminal device 400, and the terminal device 400 executes the corresponding control command and displays the corresponding game picture, thereby improving the efficiency of voice interactive control.
In some embodiments, the interactive control method based on voice data according to the embodiments of the present application may also be applied in the following application scenarios:
(1) Automatic driving: the user speaks the control instruction related to automatic driving to the terminal equipment, the terminal equipment recognizes the voice of the user, the voice signal is converted into text data, the control instruction is determined based on the text data by calling the semantic understanding model trained by the interactive control method based on the voice data in the embodiment of the application, and the vehicle is controlled to execute the corresponding control instruction.
(2) Application program control: for example: the application program is online conference software, a user can speak control instructions such as 'enter conference', 'exit conference' to the terminal equipment, the terminal equipment recognizes the voice of the user, the voice signal is converted into text data, the semantic understanding model trained by the interaction control method based on the voice data is called based on the text data to determine the control instructions, and the online conference software is controlled to execute the corresponding control instructions for entering or exiting the conference.
(3) Game interaction control: for example: the game is a card game, a user speaks a control instruction of 'card playing' to the terminal equipment, the terminal equipment recognizes the voice of the user, the voice signal is converted into text data, the semantic understanding model trained by the interactive control method based on the voice data is called based on the text data to determine the control instruction as card playing, the card game is controlled to execute automatic card playing, and a card playing result is displayed in a man-machine interaction interface of the terminal equipment.
For another example: a client (such as a game application) is operated on the terminal equipment, and a virtual scene including role playing is output in the operation process of the client, wherein the virtual scene can be an environment for interaction of game roles, such as plain, street, valley and the like for the game roles to fight; the first virtual object may be a game character under control of a user, i.e. the first virtual object is controlled by a real user. The user speaks control instructions for controlling the virtual object to move leftwards, jump and the like to the terminal equipment, the terminal equipment recognizes the voice of the user, converts sound signals into text data, calls a semantic understanding model trained by the interactive control method based on the voice data according to the embodiment of the application to determine the control instructions as playing cards, controls card games to execute automatic playing cards, and displays the moving process corresponding to the virtual object in a man-machine interaction interface of the terminal equipment.
The embodiment of the application can be realized by a block chain technology, the image processing model trained by the embodiment of the application can be uploaded to a block chain for storage, and the reliability of the image processing model is ensured by a consensus algorithm. Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains a batch of information for network transactions, for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The embodiment of the application can be realized through a Database technology, and a Database (Database) can be taken as a place where the electronic file cabinet stores electronic files in short, so that a user can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.
The Database management system (Database MANAGEMENT SYSTEM, DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by supported computer types, such as server clusters, mobile phones; or by classification according to the query language used, such as structured query language (SQL, structured Query Language), XQuery; or by performance impact emphasis, such as maximum scale, maximum speed of operation; or other classification schemes. Regardless of the manner of classification used, some DBMSs are able to support multiple query languages across categories, for example, simultaneously.
The embodiment of the application can also be realized by Cloud Technology, and the Cloud Technology (Cloud Technology) is based on the general terms of network Technology, information Technology, integration Technology, management platform Technology, application Technology and the like applied by a Cloud computing business mode, can form a resource pool, and is used as required, flexible and convenient. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the advanced development and application of the internet industry and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, each article possibly has a hash code identification mark, and the hash code identification mark needs to be transmitted to a background system for logic processing, and data of different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
In some embodiments, training server 200-1 and speech recognition server 200-2 may be integrated into a single physical server.
In some embodiments, the training server 200-1 or the voice recognition server 200-2 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The electronic device may be, but is not limited to, a smart phone, tablet computer, notebook computer, desktop computer, smart speaker, smart watch, etc. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present invention.
Referring to fig. 2A, fig. 2A is a schematic structural diagram of a server according to an embodiment of the present application, and the training server 200-1 shown in fig. 2A includes: at least one processor 410, a memory 450, at least one network interface 420. The various components in training server 200-1 are coupled together by bus system 440. It is appreciated that the bus system 440 is used to facilitate connected communications between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 440 in fig. 2A.
The Processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., wherein the general purpose Processor may be a microprocessor or any conventional Processor, etc.
Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices that are physically located remote from processor 410.
Memory 450 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM) and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
Network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
In some embodiments, the interactive control device based on voice data provided in the embodiments of the present application may be implemented in software, and fig. 2A shows the interactive control device 455 based on voice data stored in the memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: sample acquisition module 4551, sample processing module 4552 and model training module 4553, which are logical and therefore may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be described hereinafter.
The interactive control method based on voice data provided by the embodiment of the application will be described in connection with the exemplary application and implementation of the server provided by the embodiment of the application.
Referring to fig. 3A, fig. 3A is a schematic flow chart of an interactive control method based on voice data according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.
In step 301, text data for use as training samples is acquired.
The text data comprises a plurality of control instruction texts, and the plurality of control instruction texts in the field corresponding to the service requirements can be acquired according to the service requirements (such as games, autopilots, smart homes, application programs and the like) corresponding to the interactive control, and the acquired text data is used as a training sample. For example: the business of the interaction control is card game, and the control instruction is the contents of card playing, land robbery and the like. For another example: the interactive control business is automatic driving, and the control instruction is left turning, side parking, acceleration and the like. Embodiments of the present application will be explained below by taking card games as examples.
In step 302, semantic structuring is performed on the text data to obtain semantic structure data for each control instruction.
By way of example, the semantics are meaning of language vocabulary, and the semantic structuring process is to perform structural conversion process on text data according to the semantics of the text data. In some embodiments, referring to fig. 3B, a flow chart of an interactive control method based on voice data according to an embodiment of the present application is shown; step 302 may be implemented by the following steps 3021 to 3022, which are described in detail below.
In step 3021, the following processing is performed for each control instruction text: and acquiring the vocabulary attribute of each vocabulary in the control instruction text.
By way of example, lexical properties include: entity words and non-entity words. The entity word may be a character name, an object name, a verb, etc., and the non-entity word may be a mood word. For example: the control instruction is "I want to go out red peach six", wherein I is a character, is a non-entity word, goes out a verb, and red peach is the suit of the card, and six is the serial number (or called size) corresponding to the card. For another example: in "I want to rob land owners," it is a non-entity word. I are roles, rob is an action, and the owner is a role.
In step 3022, the lexical attribute tags corresponding to each entity word in the control instruction text are used to replace the entity words, and each non-entity word in the control instruction text is reserved to obtain semantic structure data of each control instruction.
By way of example, the vocabulary attribute tag refers to a tag that takes the attribute of a vocabulary as the vocabulary, and the vocabulary attribute tag takes a card game as an example, and the attribute of the entity word in the control instruction includes: serial number, character, action, suit, shape, size, etc. Continuing to explain based on the control instruction text exemplified above, the vocabulary attribute label corresponding to each entity word in "I want to go out of the red peach six" can replace the corresponding entity word to obtain semantic structure data "$role 1 needs $action $color $size.
According to the embodiment of the application, the text data is converted into the semantic structure data, so that the text data is convenient to process by the computer equipment, the efficiency of training the model is improved, the accuracy of the training model is further improved, and the interaction efficiency of interaction control is improved.
With continued reference to fig. 3A, in step 303, a weight value corresponding to each control instruction is obtained.
In some embodiments, referring to fig. 3C, fig. 3C is a flow chart of an interactive control method based on voice data according to an embodiment of the present application; step 303 may be implemented by the following steps 3031 to 3033, which are described in detail below.
In step 3031, the following processing is performed for each control instruction text: and acquiring the vocabulary quantity in the control instruction text.
Continuing with the card game example exemplified above, for example: the control instruction I want to go out of the red peach six includes five words (I want, go out, red peach, six); the control command "deal" is a word that is used to control the game client to deal automatically.
In step 3032, when the number of words is 1, the occurrence probability of words corresponding to the control instruction text in the text data is obtained, and the occurrence probability is used as a weight value corresponding to the control instruction.
For example, the text data is the text data used as the training sample in step 301, and when the vocabulary corresponding to the control instruction text is 1, the probability that the vocabulary corresponding to the control instruction text appears in all the text data used as the training sample is obtained. The probability of occurrence is the ratio of the following parameters: the frequency of occurrence of the vocabulary in all text data, the total vocabulary number in the text data.
In step 3033, when the number of words is greater than 1, each word in the control instruction text is combined into a word sequence, probability prediction processing is performed on the word sequence, word sequence probability is obtained, and the word sequence probability is used as a weight value of the control instruction.
By way of example, when the vocabulary quantity is greater than 1, take "I want to go out red peach six" as an example, and combine into a vocabulary sequence [ I want, go out, red peach, six ], and the probability corresponding to the vocabulary sequence [ I want, go out, red peach, six ] can be predicted through an N-gram language model. The principle of N-gram language model prediction is as follows: and taking the first N-1 words in the vocabulary sequence as history, and predicting the occurrence probability of the Nth vocabulary. The probability of a vocabulary sequence [ I am, II am ] can be characterized as the product of the probability of occurrence of each vocabulary in the vocabulary sequence.
By way of example, steps 3031 through 3033 may be characterized by the following formula (1).
Wherein, uni is a single word in the dictionary, uni count (x) refers to the statistical word frequency of the single word; SUM CNT refers to the SUM of all vocabulary word frequencies; n-gram prob refers to the N-gram model (N-gram) predictive probability score; the gram count is used to count the number of words included in the control instruction.
For example: 1. the control instruction text is "play" and includes a word. The word frequency of the card (i.e., the frequency of occurrence in the text data as a training sample) is 60 and the total vocabulary number of the text data is 10000, and when the frequency of occurrence of the text of the control instruction is taken as a score, score ('card output')=60/10000=0.006. 2. The control instruction text is "red peach six" composed of two words, and gram count is equal to 2. The probability was calculated using an N-gram model, N-gram_prob ('red peach', 'six') =0.018.
With continued reference to fig. 3A, in step 304, the semantic structure data of each control instruction is labeled based on the weight value of each control instruction, to obtain weighted semantic data.
In some embodiments, more weighted semantic data may be generated for training based on semantic structure data of existing training samples as templates. Referring to fig. 3D, fig. 3D is a flow chart of an interactive control method based on voice data according to an embodiment of the present application; step 304 may be implemented by the following steps 3041 to 3045, which are described in detail below.
In step 3041, a plurality of entity words associated with each vocabulary attribute tag and the occurrence frequency corresponding to each entity word are obtained, and the occurrence frequency corresponding to each entity word is used as a weight value of the entity word.
The manner in which the frequency of occurrence is obtained has been described above in step 3032, for example.
In step 3042, the following processing is performed on the semantic structure data of each control instruction: and taking the semantic structure data of the control instruction as a semantic template.
By way of example, for example: the semantic structure data corresponding to 'I want to go out of the red peach six', '$role 1 needs to act $with $and $size of the flower' as a semantic template (PAT), and each lexical attribute tag included in the semantic template (PAT) can be used as a keyword (Slot).
In step 3043, a plurality of entity words associated with vocabulary attribute tags included in the semantic template are combined to obtain new multi-segment new semantic structure data.
In some embodiments, step 3043 may be implemented by: performing a plurality of combining processes based on the semantic templates to obtain a plurality of different vocabulary sequences, wherein the combining processes include: extracting a target entity word from a plurality of entity words associated with each vocabulary attribute tag; according to the sequence of each vocabulary attribute label in the semantic template, combining each target entity word into a vocabulary sequence in turn; and combining the non-entity words in the semantic templates with each vocabulary sequence respectively to obtain new multi-segment new semantic structure data.
For example, assuming that the vocabulary sequence a= { vocabulary a, vocabulary B }, the vocabulary sequence b= { vocabulary 0, vocabulary 1, vocabulary 2}, each vocabulary combination in the two vocabulary sequences may obtain the vocabulary sequence set { (vocabulary a, vocabulary 0), (vocabulary a, vocabulary 1), (vocabulary a, vocabulary 2), (vocabulary B, vocabulary 0), (vocabulary B, vocabulary 1), (vocabulary B, vocabulary 2) }. Assuming that the non-entity words in the semantic templates are "yes", based on the combination of "I want to go out of the six red peaches", instructions "I want to go out of the three black peaches", "I want to go out of the two blocks" and the like corresponding to the new semantic structure data can be obtained.
In some embodiments, before combining non-entity words in the semantic templates with each vocabulary sequence, respectively, to obtain new multi-segment new semantic structure data, the new semantic structure data may be tailored by the following method (Beam algorithm): the following is performed for each vocabulary sequence: sequentially multiplying the weight values corresponding to each entity word included in the vocabulary sequence to obtain the word sequence probability corresponding to the vocabulary sequence; and carrying out descending order sorting processing on each vocabulary sequence based on the word sequence probability, and reserving at least one vocabulary sequence of the head part of the descending order sorting processing result, wherein the at least one vocabulary sequence is used for generating new semantic structure data.
By way of example, step 3043 may be characterized by the following formulas (2.1), (2.2), and (2.3).
decare(A,B)={(x,y,wai*wbi)|xi∈A,yi∈B,wai∈A,wbi∈B} (2.1)
beam({c1,c2……cn})={ci|1≤i≤n,ci≥beamthreshold} (2.2)
sementicgen({seq})
={beam(decare(seqi,seqi+1))|seqi∈seq,1≤i≤n-1} (2.3)
Wherein decare (a, B) characterizes weighted expansion of adjacent vocabulary sequences a and B using a cartesian product algorithm. Cartesian product algorithm: assuming that set a= { a, B }, set b= {0,1,2}, the cartesian product of the two sets is { (a, 0), (a, 1), (a, 2), (B, 0), (B, 1), (B, 2) }.
Wherein, formula (2.2), formula (2.3), sementic gen ({ seq }) characterizes dynamic ordering and clipping of the expansion number using a Beam algorithm, generating weighted semantic data. The Beam algorithm is, for example, a clipping algorithm, and is used to obtain probabilities corresponding to word sequences and to retain a plurality of word sequences with highest probabilities. For example: there are 3*3 =9 candidates in total. The Beam algorithm retains 6, then the top 6 of these 9 candidates, with a higher probability, are retained.
In step 3044, a weight value for each piece of new semantic structure data is determined based on the weight value for the entity word included in each piece of new semantic structure data.
By way of example, the weight value for each piece of new semantic structure data may be the product of the following parameters: the weight value of each entity word included in the new semantic structure data.
In step 3045, labeling processing is performed on each piece of new semantic structure data, so as to obtain weighted semantic data.
For example, the weight value is marked in the corresponding new semantic structure data to obtain weighted semantic data. For example: i want to go out red peach one: 0.0225.
According to the embodiment of the application, a large amount of new semantic data is generated by taking the existing semantic structure data as the template, so that the computing resources required for labeling the training sample are saved, the efficiency of training the semantic understanding model is improved, and the accuracy of the training model is further improved.
With continued reference to FIG. 3A, in step 305, a semantic understanding model is trained based on weighted semantic data.
By way of example, the semantic understanding model may be combined by different models, or may be end-to-end. The trained semantic understanding model is used for converting voice data into text data and identifying control instructions corresponding to the text data.
In some embodiments, referring to fig. 2B, fig. 2B is a schematic structural diagram of a semantic understanding model provided by an embodiment of the present application; the semantic understanding model 201C includes a speech recognition model 202C and a domain classification model 203C. Referring to fig. 3E, fig. 3E is a flow chart of an interactive control method based on voice data according to an embodiment of the present application; step 305 may be implemented through steps 3051E to 3053E, as described in detail below.
In step 3051E, the weighted semantic data is normalized to obtain normalized weighted semantic data.
In step 3052E, a speech recognition model is trained based on the normalized weighted semantic data, and a domain classification model is trained based on the normalized weighted semantic data.
The speech recognition model is a model for converting speech data into text data and semantics, and the domain classification model is a model for predicting control instructions corresponding to semantics, and different training tasks are respectively performed for the two models.
By way of example, step 3052E may be implemented by: calling a voice recognition model to execute a semantic training task corresponding to a predictive control instruction text based on normalized weighted semantic data; based on the normalized weighted semantic data, a domain classification model is called to execute a training task of a control instruction corresponding to the predicted semantic data.
For example, the normalized weighted semantic data is used as supervision information, and parameters of the speech recognition model are updated, so that the speech recognition model has the functions of converting speech data into text data and converting the text data into corresponding semantic data.
For example, the domain classification model is trained synchronously, normalized weighted semantic data is input into the domain classification model, so that the domain classification model performs the following processing: and taking each section of normalized weighted semantic data as a matching template, acquiring any section of semantic data, carrying out fuzzy matching processing on any section of semantic data and characters of each section of normalized weighted semantic data, acquiring control instructions corresponding to the normalized weighted semantic data matched with the semantic data, determining cross entropy loss of the domain classification model based on the difference between the control instructions obtained by matching and the control instructions corresponding to the semantic data, and updating parameters of the domain classification model based on the cross entropy loss.
In step 3053E, the trained speech recognition model is combined with the trained domain classification model to obtain a trained semantic understanding model.
For example, the output of the trained speech recognition model is used as the input of the trained domain classification model, and the two are combined.
In some embodiments, the trained speech recognition model may be applied in an automatic speech recognition engine, the normalized weighted semantic data may be stored as a bin-format file, and the domain classification model and normalized weighted semantic data may be applied in a semantic understanding algorithm module. And taking the output of the automatic speech recognition engine as the input of the semantic understanding algorithm module to realize the combination of the models.
By way of example, interactive control based on voice data is realized through different models, training of the different models can be synchronously performed, time required for training the models is saved, and efficiency of training the models is improved. By training the field classification model independently, the efficiency and the accuracy of matching control instructions are improved, and the accuracy and the interaction efficiency of interaction control based on voice data are improved.
In some embodiments, referring to fig. 3F, fig. 3F is a flow chart of an interactive control method based on voice data according to an embodiment of the present application; the model may be trained in an end-to-end manner, and step 305 may be implemented through steps 3051F through 3053F, as described in more detail below.
In step 3051F, based on weighted semantic data corresponding to each control instruction, a semantic understanding model is called to perform instruction prediction processing, and a predicted instruction is obtained.
For example, the semantic understanding model performs matching processing on weighted semantic data and semantic data corresponding to stored control instructions (acquires the number of coincident characters between the semantic data or acquires the similarity between the semantic data), and takes an instruction corresponding to the semantic data with the highest matching degree as a prediction instruction.
In step 3052F, a first prediction penalty of the semantic understanding model is determined based on a difference between the predicted instruction and a control instruction corresponding to weighted semantic data.
The first prediction loss may be a loss function of various types such as a mean absolute error (Mean Absolute Error), a Cross entropy loss (Cross Entry), and a mean absolute error (Mean Absolute Error), for example, by taking a difference between a prediction instruction and a control instruction corresponding to weighted semantic data as a factor.
Taking cross entropy loss as an example, cross entropy can be used as a loss function in a neural network (machine learning), and in the embodiment of the present application, the cross entropy loss function can measure the similarity between an actual control instruction and a predicted control instruction, and can be represented by the following formula (3):
Wherein, C is the type quantity of the control instruction, each control instruction corresponds to a type respectively, N represents the number of training samples, y ij represents whether the ith section of semantic sample belongs to the control instruction of the jth type, p ij represents the probability that the ith training sample is predicted to be the jth type, and the value range is [0,1].
In step 3053F, the semantic understanding model is back-propagated based on the first prediction loss to obtain a trained semantic understanding model.
For example, parameters of the semantic understanding model are updated based on the first prediction loss, and a trained semantic understanding model is obtained.
According to the embodiment of the application, the model is trained end to end, so that the calculation resources required by training the model are saved, the accuracy of the model predictive control instruction is higher, and the accuracy of the interactive control is further improved.
In some embodiments, after step 305, the semantic structure data corresponding to each control instruction is encapsulated, so as to obtain semantic frame data corresponding to each control instruction; storing the correspondence between: each control instruction and semantic frame data corresponding to each control instruction.
For example, a frame header of the semantic structure data is determined based on the type of the entity word in the semantic structure data, and the frame header and the data body are packaged into frame data. For example: the weighted semantic data is 0.9, wherein the types of entity words are respectively $suited $size, and the entity words are converted into corresponding frame data as follows: { [ { domain } 'game name', type } 'template', name } 'chupai <1 >, regular expression }' I want [ (out of the way ] $suit $ size ', meaning }' play } 'word slot [ (in the way of the suit', 'size' ], weight: 0.9}, …, ].
In some embodiments, referring to fig. 3G, fig. 3G is a flow chart of an interactive control method based on voice data according to an embodiment of the present application; after step 305, the server is taken as an execution subject, and the control instruction corresponding to the voice data is identified through the following steps 306 to 306.
In step 306, in response to receiving the voice data, invoking the semantic understanding model based on the voice data performs the following processing: and converting the voice data into text data, and performing fuzzy matching processing on the text data and the semantic frame data to obtain the search confidence corresponding to each segment of semantic frame data.
The voice data is, for example, sent to the server after the terminal device recognizes the voice of the user. Fuzzy matching processes are used to compare two or more records and calculate the likelihood that they belong to the same entity. And calculating the similarity between each character string and the target character string, and taking the character string with the highest similarity as a fuzzy matching result (search confidence) with the target character string. For calculating the similarity between strings, the most common idea is to use an edit distance algorithm.
In step 307, a control instruction corresponding to the semantic frame data with the highest search confidence is obtained, and the control instruction is executed.
In the embodiment of the application, the semantics are used as matched retrieval data, so that the computing resources required by the retrieval control instruction are saved through fuzzy matching, the accuracy of the matching control instruction is ensured, the control efficiency of interactive control is improved, and the response speed of voice control can be improved.
The embodiment of the application also provides an interaction control method based on voice data, which takes terminal equipment as an execution main body, and referring to fig. 4D, fig. 4D is a schematic flow chart of the interaction control method based on voice data provided by the embodiment of the application, and the steps shown in fig. 4D are combined for description.
In step 401D, a virtual scene is displayed in a human-machine interaction interface.
By way of example, the virtual scene may be a game virtual scene or a screen of an application.
In step 402D, voice data is acquired.
For example, the sound receiving section in the terminal device acquires sound made by the user, converts the sound from an acoustic signal into an electrical signal, and converts the electrical signal into voice data.
In step 403D, a semantic understanding model is called based on the voice data to perform semantic recognition processing, and a control instruction corresponding to the voice data is determined.
By way of example, the semantic understanding model is trained by the interactive control method based on voice data according to the embodiment of the application.
In step 404D, control instructions are executed.
For example, referring to fig. 4A to fig. 4C, fig. 4A to fig. 4C are schematic diagrams of a man-machine interaction interface corresponding to a terminal device according to an embodiment of the present application; taking a card game as an example, a prompt message 401A is displayed in a man-machine interaction interface to prompt a user (player) to speak a voice control instruction, wherein the control instruction can be a card playing, the terminal equipment recognizes the voice of the user, converts the voice into corresponding voice data, sends the voice data to a server for voice recognition, and simultaneously displays the prompt message 402A to prompt the user that the voice is being recognized. When the control instruction corresponding to the voice data is identified, the server returns the corresponding meaning and the control instruction to the terminal device, and the terminal device displays the prompt message 403A of "automatic card-out" and the execution result 404A, and the execution result of card-out of the control instruction is that two cards (the sequence numbers are all J) are played in fig. 4C.
Referring to fig. 5, fig. 5 is a schematic flow chart of an interactive control method based on voice data according to an embodiment of the present application, and fig. 5 illustrates a process of implementing the interactive control method based on voice data according to an embodiment of the present application by cooperation of a server 200 and a terminal device 400. The server 200 includes a training server 200-1 and a speech recognition server 200-2.
The terminal device 400 performs step 501 of converting the received voice into voice data and transmitting the voice data to the server 200.
The server 200 converts the voice data into text data and identifies a control command corresponding to the text data, step 502. The server 200 performs step 503 and transmits a control instruction to the terminal device 400.
The terminal device 400 executes step 504, executes the control instruction, and displays a screen corresponding to the control instruction.
The control instruction is converted into the semantic structure data, the semantic structure is marked with the weight value, weighted semantic data is generated based on the language structure data marked with the weight value, the marking accuracy of the training sample is improved, the accuracy of the training semantic understanding model is improved, and the accuracy of the semantic understanding model identification semantic and the control instruction in the interactive control process is further improved.
In the following, an exemplary application of the interactive control method based on voice data according to the embodiment of the present application in an actual application scenario will be described.
In the related technology, physical spectrum characteristics of sound are used, a game control keyword is identified by using an on-terminal small voice recognition system, and finally, a control instruction is determined by using the keyword and is executed. In the mode, due to poor recognition accuracy, small number and low speed of the offline voice keywords, the hit rate of the keyword matching instruction is affected, and the interactive control efficiency is low. Taking a game control by voice recognition as an example, in the related technology, a voice recognition ASR system is utilized to recognize texts, the texts are converted into vector sequences, the vector sequences are compared and matched with the vector sequences corresponding to the control instructions of the game service, and finally the control instructions are obtained. However, the text vector cannot be compatible with different contexts, if a new application scene is encountered, the text vector needs to be retrained, and the service supporting efficiency is low; the text vector has weak capability of supporting different semantic sentence patterns, and the interactive control efficiency is low.
According to the interactive control method based on the voice data, which is provided by the embodiment of the application, the control instruction text corresponding to the service (such as games and map software) requirements in an application scene is converted into a semantic template (PAT) and a semantic keyword (Slot). Has the following advantages: 1. the application range is wide. The required range of the control instruction is characterized and covered by using a semantic template, and the object related to the control instruction can be characterized and covered by using a semantic keyword. 2. The updating efficiency is high. If the demand range of the control instruction changes, only the corresponding semantic template is required to be updated; the object related to the control instruction changes, and only the corresponding semantic keywords need to be updated. 3. Semantic data can be multiplexed. Semantic templates and semantic entities (semantic keywords) can automatically build weighted semantic data and semantics.
According to the embodiment of the application, weighted semantic data generated by personalized semantic templates and semantic entities are trained to manufacture a semantic understanding model capable of being recognized by voice. Has the following advantages: 1. the automation efficiency of converting the control instruction into the personalized ASR model data is high. The control instructions use semantic structured characterizations to automatically generate data and train the ASR personalized recognition resources. 2. The control instruction and the automatic voice recognition are communicated through personalized semantic data, the voice recognition precision is greatly improved, and the interactive control efficiency is improved.
The embodiment of the application prepares the semantic template and the semantic real data into weighted semantic frame data, so that a semantic interpretation algorithm can recognize the control instruction. Has the following advantages: 1. the efficiency of converting the control instruction into personalized semantic frame data is high. The semantic structural representation of the control instruction is converted into semantic frame data, and the conversion efficiency is high. 2. The control instruction and the semantic understanding are communicated through the personalized (structured) semantic frame data, and the semantic understanding precision is greatly improved.
Referring to fig. 6A, fig. 6A is a flowchart of an interactive control method based on voice data according to an embodiment of the present application, and a server is used as an execution body, and the steps shown in fig. 6A will be described.
In step 601A, semantic data will be format converted.
Illustratively, the text of the control instruction is obtained according to the business requirement before step 601A.
Referring to fig. 6C, fig. 6C is a flowchart illustrating an interactive control method based on voice data according to an embodiment of the present application.
In step 601C, the text of the control instruction is semantically formatted to obtain a semantic template and a semantic keyword.
For example: the service is a card game, and the control instruction comprises: card-out, calling to land, robbing to land, etc. Taking entity words with variable same attributes in sentence patterns of the text of the control instruction as semantic keyword fragments, and replacing the semantic keyword fragments with attribute tags; the remaining content portion as a semantic template remains unchanged.
The text of the control instruction is characterized by using the semantic templates and the semantic keyword vocabulary. And constructing a semantic template Pat-class and a semantic keyword Slot-class which accord with the control instruction text. The semantic template is a keyword-style content segment, and the variable keyword position is vacated, and a symbol capable of representing keyword category information is used for replacing the keyword category information. For example: based on the control instruction text (business related instruction text) of the bucket landline call voice dialogue, the semantic templates Pat-class and the semantic keywords Slot-class are prepared, and refer to the following tables 1 and 2.
Control instruction text Corresponding semantic template Pat-class
Called the ground, not called the ground "come again one office" Action 2 binderless and masterless:
"Di owner plays a card", "Shang Jia Kuai" and "Chi Zhi" $Role 2 others $action 3 others
'I want to rob land owner', 'cancel the escrow' Role 1 itself is to be $action 2 bineless and bineless
"I want to go out A", "I want to go out old K" I want to go out $
"I want to go out six red peaches", "I want to beat plum blossom nine" I want to send out $ color $ size
Watch (1)
Types of semantic keywords Slot-class Data sets of keywords of each type (example)
Action 2 binderless and binderless Called the ground, not called the ground "come again one office"
Role 2 others Go home, land owner "
Action 3 others "Out-of-the-name" and "quick-point-of-the-name" and "quick-speed-like Chinese medicine"
Role 1 itself "I", "_"
Size $ "A", "six", "nine", "old K"
Design and color "Red peach", "black peach", "square sheet", "plum blossom"
Watch (2)
For example, in step 601A, the semantic templates and semantic keywords are formatted to obtain weighted semantic data and structured semantic frame data.
Step 602A is performed after step 601A to train a speech recognition model, and step 604A is performed to train a domain classification model, and step 602A is explained below. Steps 601A to 602A may be implemented through steps 601C to 603C, and in step 602C, weight values corresponding to the semantic templates and the semantic keywords, respectively, are calculated. In step 603C, the semantic templates and semantic keywords are labeled with weight values.
In some embodiments, the semantic templates and semantic keywords may be scored by the following equation (1), with the score being the weight to which the semantic data corresponds.
Wherein, uni is a single word in the dictionary, uni count (x) refers to the statistical word frequency of the single word; SUM CNT refers to the SUM of all vocabulary word frequencies; n-gram prob refers to the N-gram model (N-gram) predictive probability score; the gram count is used to count the number of words included in the control instruction.
For example: 1. the control instruction text is "play" and includes a word. The frequency of the card is 60, and the sum of the frequencies of all words is 10000, and the frequency of occurrence of the text of the control instruction is taken as a score, and score ('card out')=60/10000=0.006. 2. The control instruction text is "red peach six" composed of two words, and gram count is equal to 2. The probability was calculated using an N-gram model, N-gram_prob ('red peach', 'six') =0.018.
Wherein, in order to solve the problem of excessive number of free parameters, a Markov assumption is introduced, and the probability of the occurrence of an arbitrary word is only related to the limited n-1 words occurring in front of the arbitrary word. The statistical language model based on the above assumption is called an N-gram language model. That is, the probability of the current (nth) word is estimated using the first N-1 words as history.
The weight value labeling is carried out on the semantic keywords or semantic templates in the table (1) or the table (2), and the weight value labeling can be characterized as follows:
PAT "i am going out $size": 0.9;
PAT "$role 2 others $action 3 others": 0.6;
SLOT "called ground owner": 0.7, "want to nothing": 0.8, "come again to one office": 0.2.
Illustratively, a large amount of weighted semantic data is generated based on the semantic templates and the semantic keywords, which can be used to train the semantic understanding model. A semantic template and a set of related words contained in the semantic template are obtained. For example: the semantic templates include keywords of the type $ HREO, $ HREO are entity words, and the corresponding set of keywords may be hero's names (role A, role B).
For example, according to the semantic feature form of the semantic template, the keyword labels in the semantic feature form are pointed to the corresponding keyword sets to obtain a large amount of weighted text data, and the calculation process can be characterized as the following formulas (2.1), (2.2) and (2.3).
decare(A,B)={(x,y,wai*wbi)|xi∈A,yi∈B,wai∈A,wbi∈B} (2.1)
beam({c1,c2……cn})={ci|1≤i≤n,ci≥beamthreshold} (2.2)
sementicgen({seq})
={beam(decare(seqi,seqi+1))|seqi∈seq,1≤i≤n-1} (2.3)
Wherein decare (a, B) characterizes weighted expansion of adjacent vocabulary sequences a and B using a cartesian product algorithm. Cartesian product algorithm: assuming that set a= { a, B }, set b= {0,1,2}, the cartesian product of the two sets is { (a, 0), (a, 1), (a, 2), (B, 0), (B, 1), (B, 2) }.
Wherein, formula (2.2), formula (2.3), sementic gen ({ seq }) characterizes dynamic ordering and clipping of the expansion number using a Beam algorithm, generating weighted semantic data. The Beam algorithm is, for example, a clipping algorithm, and is used to obtain probabilities corresponding to word sequences and to retain a plurality of word sequences with highest probabilities. For example: there are 3*3 =9 candidates in total. The Beam algorithm retains 6, then the top 6 of these 9 candidates, with a higher probability, are retained.
Illustratively, generating weighted semantic data based on the semantic templates and the semantic keywords can be characterized as table (3) below.
Watch (3)
Referring to fig. 6E, fig. 6E is a flowchart illustrating an interactive control method based on voice data according to an embodiment of the present application.
In step 601E, normalization processing is performed on the semantic weighted data to obtain normalized semantic weighted data. In step 602E, a language model N-gram is trained based on the normalized semantic weighted data. In step 603E, an automatic speech recognition model is built based on the trained language model N-gram.
Illustratively, normalized weighted semantic data is used to train an N-gram model (domain classification model), and state machine resources are identified based on the N-gram model to produce the weighted traffic classification information. The N-gram model is used for counting the vocabulary sequences with the length being greater than or equal to N, and calculating the probability of combining the N-1 vocabulary with the N-th vocabulary to obtain the vocabulary sequences through the counted data and the concept of maximum likelihood.
By way of example, referring to FIG. 6D, FIG. 6D is a schematic diagram of training a speech recognition model in an embodiment of the present application; the state node 0 of the state machine, the state node 1 of the state machine, the state node 2 of the state machine, the state node 3 of the state machine, the state node 4 of the state machine and the state node 5 of the state machine. Arrows between the state nodes represent the data jump directions between the state nodes, and each arrow marks a semantic keyword and a weight value corresponding to the semantic keyword. For example: the data jump directions of the vocabulary sequences [ out, plum blossom, six ] are the state node 0 of the state machine, the state node 1 of the state machine, the state node 2 of the state machine and the state node 5 of the state machine. The data trend among the state machine nodes is used for representing the process of obtaining the probability corresponding to the vocabulary sequence, and the probability corresponding to the vocabulary sequence [ out, plum blossom, six ] is the product of the probability multiplication of the state node 0, the state node 0.6 and each vocabulary. The corresponding 0.6 representation of state node 5 of the state machine packages the semantic understanding model into the FST memory structure, merging the jump probabilities.
In the embodiment of the application, the state machine resources with the service classification information are manufactured based on the N-gram model, and the state machine resources have corresponding data trend, so that the efficiency of interactive control based on voice data can be improved, the computing resources are saved, and the interactive efficiency is further improved.
With continued reference to FIG. 6A, in step 603A, an ASR speech recognition engine is constructed.
Illustratively, an automatic speech recognition engine is fabricated based on the trained N-gram model. Referring to FIG. 7A, FIG. 7A is a schematic diagram of a data structure of an ASR speech recognition engine 701A according to an embodiment of the present application, where the data structure of the ASR speech recognition engine 701A includes service configuration information, a generic model, and a plurality of trained speech recognition models (speech recognition model 1, speech recognition model N-1, speech recognition model N), and fst is a storage format of a model file.
With continued reference to FIG. 6A, in step 605A, a semantic understanding model is constructed.
For example, weighted semantic structured data is converted into weighted semantic frame data, and semantic understanding policy model resources are trained and crafted to enable the semantic understanding model to determine individual control instructions in the game based on the speech data. Referring to fig. 6F, fig. 6F is a flowchart illustrating an interactive control method based on voice data according to an embodiment of the present application. Step 605A may be implemented by step 601F and step 602F.
In step 601F, weighted semantic data is converted into weighted semantic data frame data. In step 602F, weighted semantic data frame data is normalized, and a semantic understanding model is generated based on the normalized weighted semantic data frame data.
For example, semantic characterization data of game control instruction text is converted into weighted semantic frame data according to a fixed format.
The semantic understanding model loads weighted semantic frame data according to the control instruction text, performs fuzzy matching on the input text data and the weighted semantic frame data, gives the searching hit confidence of the text data for each weighted semantic frame data, selects the weighted semantic frame data corresponding to the text with the highest hit confidence as identified, and the terminal equipment executes the control instruction corresponding to the weighted semantic frame data.
For example: the weighted semantic data is: i want to send out $ color $ size 0.9
The conversion into corresponding frame data is as follows: { [ { domain } ' game name ', type } ' template } ' name } ' chupai1', regular expression } ' I want [ (out of the way ] $suit $ size ', intention } ' play } ' word slot [ (in the way of flower } ' size ], weight: 0.9}, …, ].
The semantic keywords are: design and color (red peach: 0.25, black peach: 0.25, plum blossom: 0.25 …)
The conversion into corresponding frame data is as follows: { domain } 'game name', type } 'word slot', key } 'flower color', value [ 'red peach: -0.25', 'black peach: -0.25', 'plum: -0.25' … ] }
With continued reference to fig. 6A, in step 605A, semantic recognition and intent extraction is performed.
Referring to FIG. 7B, FIG. 7B is a schematic diagram of a data structure of a semantic understanding model according to an embodiment of the present application; the semantic understanding model 701B may be used for semantic recognition and intent understanding. The frame data may be stored as a format bin, and the semantic understanding model 701B further includes a plurality of semantic templates (semantic template 1, semantic template 2 … …, semantic template N) and a plurality of semantic keywords (semantic keyword 1, semantic keyword 2 … …, semantic keyword N).
Referring to fig. 6B, fig. 6B is a schematic diagram of an application scenario provided by an embodiment of the present application; wherein the voice interaction control service 601B is operated in the server 200-2 in fig. 1, and the client 602B is operated in the terminal apparatus 400. Referring to fig. 4A to fig. 4C, fig. 4A to fig. 4C are schematic diagrams of a man-machine interaction interface corresponding to a terminal device according to an embodiment of the present application; taking a card game as an example, a prompt message 401A is displayed in a man-machine interaction interface to prompt a user (player) to speak a voice control instruction, wherein the control instruction can be a card playing, the terminal equipment recognizes the voice of the user, converts the voice into corresponding voice data, sends the voice data to a server for voice recognition, and simultaneously displays the prompt message 402A to prompt the user that the voice is being recognized. When the control instruction corresponding to the voice data is identified, the server returns the corresponding meaning and the control instruction to the terminal device, and the terminal device displays the prompt message 403A of "automatic card-out" and the execution result 404A, and the execution result of card-out of the control instruction is that two cards (the sequence numbers are all J) are played in fig. 4C.
The embodiment of the application is superior to the bid in the aspects of flexibility and efficiency of voice recognition service support. The method is superior to the bid in terms of the voice recognition service supporting effect and the achievement rate of the voice recognition service. The control instruction text of the abstract service is structured and embodied, so that the rapid customization and data sharing are supported, and the service support efficiency is improved. The ASR speech recognition engine uses a semantic understanding model to improve the accuracy of speech recognition text from the source. The semantic enhancement runs through the system full link, each link promotes the voice recognition service supporting effect together, and finally promotes the interactive control efficiency of the voice recognition service, and referring to fig. 7C, fig. 7C is an effect comparison table of the embodiment of the present application, it can be known that the semantic understanding model of the interactive control method based on voice data provided by the embodiment of the present application is higher than the full-field model of the related art in terms of aliasing degree, recognition accuracy, interactive control efficiency (command achievement rate), etc.
Continuing with the description below of an exemplary architecture of the voice data based interactive control device 455 implemented as a software module provided by embodiments of the present application, in some embodiments, as shown in fig. 2A, the software modules stored in the voice data based interactive control device 455 of the memory 450 may include: a sample acquisition module 4551 configured to acquire text data for use as training samples, wherein the text data comprises a plurality of control instruction texts; the sample processing module 4552 is configured to perform semantic structuring processing on the text data to obtain semantic structure data of each control instruction; the sample processing module 4552 is further configured to obtain a weight value corresponding to each control instruction; the sample processing module 4552 is further configured to label the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data; the model training module 4553 is configured to train a semantic understanding model based on weighted semantic data, wherein the trained semantic understanding model is used for converting voice data into text data and identifying control instructions corresponding to the text data.
In some embodiments, sample processing module 4552 is configured to perform the following for each control instruction text: and acquiring the vocabulary attribute of each vocabulary in the control instruction text, wherein the vocabulary attribute comprises: entity words and non-entity words; and replacing the entity words based on vocabulary attribute labels corresponding to each entity word in the control instruction text, and reserving each non-entity word in the control instruction text to obtain semantic structure data of each control instruction.
In some embodiments, sample processing module 4552 is configured to perform the following for each control instruction text: acquiring the vocabulary quantity in a control instruction text; when the number of words is 1, the occurrence probability of words corresponding to the control instruction text in the text data is obtained, and the occurrence probability is used as a weight value corresponding to the control instruction; when the number of words is greater than 1, combining each word in the control instruction text into a word sequence, carrying out probability prediction processing on the word sequence to obtain word sequence probability, and taking the word sequence probability as a weight value of the control instruction.
In some embodiments, the sample processing module 4552 is configured to obtain a plurality of entity words associated with each vocabulary attribute tag and an appearance frequency corresponding to each entity word, and take the appearance frequency corresponding to each entity word as a weight value of the entity word; the following processing is performed on the semantic structure data of each control instruction: taking semantic structure data of the control instruction as a semantic template; combining a plurality of entity words associated with vocabulary attribute tags included in the semantic templates to obtain new multi-section new semantic structure data; determining a weight value of each piece of new semantic structure data based on the weight value of the entity word included in each piece of new semantic structure data; and labeling each piece of new semantic structure data to obtain weighted semantic data.
In some embodiments, the sample processing module 4552 is configured to perform a multiple combination process based on the semantic template resulting in a plurality of different vocabulary sequences, wherein the combination process comprises: extracting a target entity word from a plurality of entity words associated with each vocabulary attribute tag; according to the sequence of each vocabulary attribute label in the semantic template, combining each target entity word into a vocabulary sequence in turn; and combining the non-entity words in the semantic templates with each vocabulary sequence respectively to obtain new multi-section new semantic structure data.
In some embodiments, the sample processing module 4552 is configured to perform the following processing for each vocabulary sequence before combining the non-entity words in the semantic template with each vocabulary sequence, respectively, to obtain new multi-segment new semantic structure data: multiplying the weight value corresponding to each entity word included in the vocabulary sequence in turn to obtain the word sequence probability corresponding to the vocabulary sequence; and carrying out descending order sorting processing on each vocabulary sequence based on the word sequence probability, and reserving at least one vocabulary sequence of the head of the descending order sorting processing result, wherein the at least one vocabulary sequence is used for generating new semantic structure data.
In some embodiments, the semantic understanding model includes a speech recognition model and a domain classification model; the model training module 4553 is configured to normalize weighted semantic data to obtain normalized weighted semantic data; training a speech recognition model based on the normalized weighted semantic data, and training a domain classification model based on the normalized weighted semantic data; and combining the trained voice recognition model with the trained domain classification model to obtain a trained semantic understanding model.
In some embodiments, the model training module 4553 is configured to invoke the speech recognition model to perform a training task of predicting semantics corresponding to the control instruction text based on the normalized weighted semantic data; based on the normalized weighted semantic data, a domain classification model is called to execute a training task of a control instruction corresponding to the predicted semantic data.
In some embodiments, the model training module 4553 is configured to invoke the semantic understanding model to perform instruction prediction processing based on weighted semantic data corresponding to each control instruction, so as to obtain a predicted instruction; determining a first prediction loss of the semantic understanding model based on the difference between the prediction instruction and the control instruction corresponding to the weighted semantic data; and carrying out back propagation processing on the semantic understanding model based on the first prediction loss to obtain a trained semantic understanding model.
In some embodiments, the model training module 4553 is configured to perform packaging processing on the semantic structure data corresponding to each control instruction after training the semantic understanding model based on weighted semantic data, so as to obtain semantic frame data corresponding to each control instruction; storing the correspondence between: each control instruction and semantic frame data corresponding to each control instruction.
In some embodiments, the model training module 4553 is configured to, after storing the correspondence between the following data, invoke the semantic understanding model based on the voice data to perform the following processing in response to receiving the voice data: converting the voice data into text data, and carrying out fuzzy matching processing on the text data and the semantic frame data to obtain a search confidence corresponding to each segment of semantic frame data; and acquiring a control instruction corresponding to the semantic frame data with the highest searching confidence, and executing the control instruction.
The embodiment of the application also provides an interaction control device based on voice data, which comprises: the display module is configured to display the virtual scene in the human-computer interaction interface; the voice acquisition module is configured to acquire voice data; the recognition module is configured to call a semantic understanding model based on the voice data to carry out semantic recognition processing and determine a control instruction corresponding to the voice data, wherein the semantic understanding model is obtained through training by the interactive control method based on the voice data; and the display module is configured to execute the control instruction.
Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer executable instructions from the computer readable storage medium, and executes the computer executable instructions, so that the computer device executes the interactive control method based on voice data according to the embodiment of the application.
Embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, cause the processor to perform the voice data based interactive control method provided by the embodiments of the present application, for example, the voice data based interactive control method as shown in fig. 3A.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or distributed across multiple sites and interconnected by a communication network.
In summary, by converting the control instruction into the semantic structure data and labeling the weight value for the semantic structure, the embodiment of the application generates the weighted semantic data based on the language structure data labeled with the weight value, thereby improving the labeling accuracy of the training sample, improving the accuracy of the training semantic understanding model, and further improving the accuracy of the semantic understanding model for identifying the semantic and the control instruction in the interactive control process.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (17)

1. An interactive control method based on voice data, the method comprising:
acquiring text data used as training samples, wherein the text data comprises a plurality of control instruction texts;
carrying out semantic structuring processing on the text data to obtain semantic structure data of each control instruction;
acquiring a weight value corresponding to each control instruction;
labeling the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data;
And training the semantic understanding model based on the weighted semantic data, wherein the trained semantic understanding model is used for converting the voice data into text data and identifying a control instruction corresponding to the text data.
2. The method according to claim 1, wherein said semantically structuring said text data to obtain semantically structured data for each of said control commands comprises:
the following processing is executed for each control instruction text:
and acquiring the vocabulary attribute of each vocabulary in the control instruction text, wherein the vocabulary attribute comprises: entity words and non-entity words;
and replacing the entity words based on vocabulary attribute labels corresponding to each entity word in the control instruction text, and reserving each non-entity word in the control instruction text to obtain semantic structure data of each control instruction.
3. The method of claim 1, wherein the obtaining the weight value corresponding to each control instruction includes:
the following processing is executed for each control instruction text:
acquiring the vocabulary quantity in the control instruction text;
When the number of words is 1, the occurrence probability of words corresponding to the control instruction text in the text data is obtained, and the occurrence probability is used as a weight value corresponding to the control instruction;
When the vocabulary quantity is larger than 1, combining each vocabulary in the control instruction text into a vocabulary sequence, carrying out probability prediction processing on the vocabulary sequence to obtain word sequence probability, and taking the word sequence probability as a weight value of the control instruction.
4. The method according to claim 2, wherein the labeling the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data includes:
Acquiring a plurality of entity words associated with each vocabulary attribute tag and the occurrence frequency corresponding to each entity word, and taking the occurrence frequency corresponding to each entity word as a weight value of the entity word;
the following processing is performed on the semantic structure data of each control instruction:
Taking the semantic structure data of the control instruction as a semantic template;
Combining a plurality of entity words associated with vocabulary attribute tags included in the semantic templates to obtain new multi-section new semantic structure data;
Determining a weight value of each piece of new semantic structure data based on the weight value of the entity word included in each piece of new semantic structure data;
and labeling each section of the new semantic structure data to obtain weighted semantic data.
5. The method of claim 4, wherein the combining the plurality of entity words associated with the vocabulary attribute tags included in the semantic template to obtain new multi-segment new semantic structure data comprises:
And performing a plurality of combination processes based on the semantic templates to obtain a plurality of different vocabulary sequences, wherein the combination processes comprise: extracting a target entity word from a plurality of entity words associated with each vocabulary attribute tag; according to the sequence of each vocabulary attribute label in the semantic template, combining each target entity word into a vocabulary sequence in turn;
And combining the non-entity words in the semantic templates with each vocabulary sequence to obtain new multi-segment new semantic structure data.
6. The method of claim 5, wherein before combining non-entity words in the semantic templates with each of the vocabulary sequences to obtain new multi-segment new semantic structure data, the method further comprises:
The following is performed for each of the vocabulary sequences: sequentially multiplying the weight values corresponding to each entity word included in the vocabulary sequence to obtain the word sequence probability corresponding to the vocabulary sequence;
And carrying out descending order sorting processing on each vocabulary sequence based on the word sequence probability, and reserving at least one vocabulary sequence of the head of the descending order sorting processing result, wherein the at least one vocabulary sequence is used for generating new semantic structure data.
7. The method of claim 1, wherein the semantic understanding model comprises a speech recognition model and a domain classification model;
The training the semantic understanding model based on the weighted semantic data comprises:
normalizing the weighted semantic data to obtain normalized weighted semantic data;
Training the speech recognition model based on the normalized weighted semantic data, and training the domain classification model based on the normalized weighted semantic data;
And combining the trained voice recognition model with the trained domain classification model to obtain a trained semantic understanding model.
8. The method of claim 7, wherein the training the speech recognition model based on the normalized weighted semantic data and training the domain classification model based on the normalized weighted semantic data comprises:
invoking the voice recognition model to execute a semantic training task corresponding to a predictive control instruction text based on the normalized weighted semantic data;
and calling the domain classification model to execute a training task of a control instruction corresponding to the predicted semantic data based on the normalized weighted semantic data.
9. The method of claim 1, wherein the training the semantic understanding model based on the weighted semantic data comprises:
Based on weighted semantic data corresponding to each control instruction, invoking the semantic understanding model to conduct instruction prediction processing to obtain a predicted instruction;
Determining a first prediction loss of the semantic understanding model based on a difference between the prediction instruction and a control instruction corresponding to weighted semantic data;
And carrying out back propagation processing on the semantic understanding model based on the first prediction loss to obtain the trained semantic understanding model.
10. The method of claim 1, wherein after the training the semantic understanding model based on the weighted semantic data, the method further comprises:
Packaging the semantic structure data corresponding to each control instruction to obtain semantic frame data corresponding to each control instruction;
storing the correspondence between: each control instruction and semantic frame data corresponding to each control instruction.
11. The method of claim 10, wherein after storing the correspondence between the following data, the method further comprises:
In response to receiving voice data, invoking the semantic understanding model based on the voice data to perform the following:
Converting the voice data into text data, and performing fuzzy matching processing on the text data and the semantic frame data to obtain search confidence corresponding to each segment of the semantic frame data;
And acquiring a control instruction corresponding to the semantic frame data with highest searching confidence and executing the control instruction.
12. An interactive control method based on voice data, the method comprising:
displaying a virtual scene in a human-computer interaction interface;
Acquiring voice data;
Invoking a semantic understanding model based on the voice data to perform semantic recognition processing, and determining a control instruction corresponding to the voice data, wherein the semantic understanding model is obtained by training the interactive control method based on the voice data according to any one of claims 1 to 11;
And executing the control instruction.
13. An interactive control device based on voice data, the device comprising:
a sample acquisition module configured to acquire text data for use as training samples, wherein the text data includes a plurality of control instruction texts;
The sample processing module is configured to perform semantic structuring processing on the text data to obtain semantic structure data of each control instruction;
the sample processing module is further configured to acquire a weight value corresponding to each control instruction;
the sample processing module is further configured to label the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data;
The model training module is configured to train the semantic understanding model based on the weighted semantic data, wherein the trained semantic understanding model is used for converting the voice data into text data and identifying control instructions corresponding to the text data.
14. An interactive control device based on voice data, the device comprising:
The display module is configured to display the virtual scene in the human-computer interaction interface;
The voice acquisition module is configured to acquire voice data;
The recognition module is configured to call a semantic understanding model based on the voice data to carry out semantic recognition processing and determine a control instruction corresponding to the voice data, wherein the semantic understanding model is trained by the interactive control method based on the voice data according to any one of claims 1 to 11;
the display module is further configured to execute the control instruction.
15. An electronic device, the electronic device comprising:
A memory for storing computer executable instructions;
a processor for implementing the voice data based interactive control method of any one of claims 1 to 12 when executing computer executable instructions stored in said memory.
16. A computer readable storage medium storing computer executable instructions which, when executed by a processor, implement the speech data based interactive control method of any one of claims 1 to 12.
17. A computer program product comprising a computer program or computer executable instructions which, when executed by a processor, implement the speech data based interactive control method of any one of claims 1 to 12.
CN202211352352.4A 2022-10-31 2022-10-31 Interactive control method, device, equipment and storage medium based on voice data Pending CN117953895A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211352352.4A CN117953895A (en) 2022-10-31 2022-10-31 Interactive control method, device, equipment and storage medium based on voice data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211352352.4A CN117953895A (en) 2022-10-31 2022-10-31 Interactive control method, device, equipment and storage medium based on voice data

Publications (1)

Publication Number Publication Date
CN117953895A true CN117953895A (en) 2024-04-30

Family

ID=90797039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211352352.4A Pending CN117953895A (en) 2022-10-31 2022-10-31 Interactive control method, device, equipment and storage medium based on voice data

Country Status (1)

Country Link
CN (1) CN117953895A (en)

Similar Documents

Publication Publication Date Title
CN109964223B (en) Session information processing method and device, storage medium
WO2021174890A1 (en) Data recommendation method and apparatus, and computer device and storage medium
US11106868B2 (en) System and method for language model personalization
US7853582B2 (en) Method and system for providing information services related to multimodal inputs
CN110930980B (en) Acoustic recognition method and system for Chinese and English mixed voice
WO2021147041A1 (en) Semantic analysis method and apparatus, device, and storage medium
CN108447471A (en) Audio recognition method and speech recognition equipment
CN112633003A (en) Address recognition method and device, computer equipment and storage medium
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN110808032A (en) Voice recognition method and device, computer equipment and storage medium
CN111651572A (en) Multi-domain task type dialogue system, method and terminal
CN111695338A (en) Interview content refining method, device, equipment and medium based on artificial intelligence
US20110093264A1 (en) Providing Information Services Related to Multimodal Inputs
CN112328761A (en) Intention label setting method and device, computer equipment and storage medium
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN112466289A (en) Voice instruction recognition method and device, voice equipment and storage medium
CN112836521A (en) Question-answer matching method and device, computer equipment and storage medium
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN116611459B (en) Translation model training method and device, electronic equipment and storage medium
US11036996B2 (en) Method and apparatus for determining (raw) video materials for news
CN116881446A (en) Semantic classification method, device, equipment and storage medium thereof
CN117953895A (en) Interactive control method, device, equipment and storage medium based on voice data
US7197494B2 (en) Method and architecture for consolidated database search for input recognition systems
US20230297603A1 (en) Cross-lingual meta-transfer learning adaptation to natural language understanding

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination