CN117995174A

CN117995174A - Learning type electric toy control method based on man-machine interaction

Info

Publication number: CN117995174A
Application number: CN202410406714.6A
Authority: CN
Inventors: 李恺
Original assignee: Guangdong Shifeng Intelligent Technology Co ltd
Current assignee: Guangdong Shifeng Intelligent Technology Co ltd
Priority date: 2024-04-07
Filing date: 2024-04-07
Publication date: 2024-05-07

Abstract

The invention provides a learning type electric toy control method based on human-computer interaction, which is characterized in that a core voice representation array with semantic representation in the voice representation array is determined based on the behavior representation array, the core voice representation array is integrated according to the behavior representation array and the core voice representation array to obtain an integrated representation array, the integrated representation array is embedded into a preset feature domain to obtain an integrated embedded representation array, so that information such as emotion, mental state and the like represented in voice data and the behavior representation array with the same information are aligned to complete feature integration, and when target voice analysis is carried out according to the integrated embedded representation array and voice marking is carried out on behavior interaction data to be analyzed according to a target voice marking set, recognition reliability is increased, so that better interactive control is provided, and better interaction experience is brought to users.

Description

Learning type electric toy control method based on man-machine interaction

Technical Field

The application relates to the field of data processing and machine learning, in particular to a learning type electric toy control method based on man-machine interaction.

Background

With the progress of science and technology and the improvement of living standard of people, the electric toy has become an important component of children entertainment and education. In particular to learning type electric toys which not only provide entertainment functions, but also integrate educational elements to help children learn new knowledge and develop intelligence in playing. However, the existing learning electric toy has yet to be improved in terms of intelligence and individualization.

Conventional learning electric toys are usually preset with fixed functions and modes, and cannot be dynamically adjusted according to the state and interest changes of children. This results in a limited entertainment and educational play that does not meet the needs of children's diversity. Furthermore, these toys cannot learn and improve from interactions with children due to the lack of effective learning mechanisms, thereby limiting their value for long-term use.

Accordingly, those skilled in the art are continually exploring how to apply more advanced machine learning and artificial intelligence techniques to learning-type electric toys. Through introducing these techniques, the study type electric toy can have stronger self-adaptation ability, can carry out intelligent adjustment according to children's action and feedback, provides more individualized amusement and education experience. Meanwhile, the toys can continuously optimize the functions and performances of the toys through continuous learning, so that the use value is prolonged, and the cost performance is improved.

In the prior art, the collection and analysis of the feedback of the user are usually single, such as voice or image, so that the user state obtained by analysis may not be accurate enough, and how to collect multi-source information is a technical problem to be overcome.

Disclosure of Invention

The invention aims to provide a learning type electric toy control method based on man-machine interaction.

The embodiment of the application is realized as follows: in a first aspect, an embodiment of the present application provides a method for controlling a learning electric toy based on man-machine interaction, the method including: acquiring a behavior representation array of the behavior interaction data to be analyzed and a voice representation array of the voice interaction data to be analyzed; the behavior interaction data to be analyzed is matching behavior data corresponding to the voice interaction data to be analyzed; determining a core voice characterization array in the voice characterization arrays according to the behavior characterization arrays; the core voice characterization array is a voice characterization array with semantic characterization involved in the behavior characterization array; performing feature interaction according to the behavior characterization array and the core voice characterization array to obtain an integrated characterization array, and embedding the integrated characterization array into a preset feature domain to obtain an integrated embedded characterization array; performing target voice analysis according to the integrated embedded representation array to obtain a target voice mark set; and formulating a target interaction control strategy according to the target voice mark set so as to control the learning type electric toy according to the target interaction control strategy.

In one embodiment, the determining a core speech characterization array of the speech characterization arrays from the behavior characterization arrays includes: determining a cross-attention influence coefficient according to the behavior characterization array and the voice characterization array; correcting the voice characterization array according to the cross-attention influence coefficient to obtain a core voice characterization array; the feature interaction is performed according to the behavior characterization array and the core voice characterization array to obtain an integrated characterization array, which comprises the following steps: determining an internal attention influence coefficient according to the behavior characterization array; correcting the behavior characterization array according to the internal attention influence coefficient to obtain a core behavior characterization array; and performing characteristic interaction on the behavior characterization array, the core voice characterization array and the core behavior characterization array to obtain an integrated characterization array.

In one embodiment, the analyzing the target voice according to the integrated embedded token array to obtain a target voice tag set includes: determining a first selected preset voice object from a plurality of preset voice objects according to the integrated embedded representation array; the first selected preset voice object is a preset voice object in the behavior interaction data to be analyzed; determining a first target voice object prototype feature parameter corresponding to the first selected preset voice object in a voice object prototype feature parameter set, wherein the voice object prototype feature parameter set comprises a voice object prototype feature parameter of each preset voice object in the plurality of preset voice objects; performing similarity measurement on the integrated embedded characterization array according to the prototype characteristic parameters of the first target voice object to obtain a target integrated embedded characterization array corresponding to the first selected preset voice object; and carrying out target voice analysis according to the target integration embedded representation array to obtain a target voice mark set corresponding to the first selected preset voice object.

In one embodiment, the determining a first selected one of the plurality of preset speech objects according to the integrated embedded token array includes: extracting an embedded characterization array at a position corresponding to a preset classification identification code from the integrated embedded characterization array to obtain a classification embedded characterization array, wherein the preset classification identification code is an identification code arranged in the head of the behavior interaction data to be analyzed when the behavior characterization array is generated; mapping classification is carried out according to the classification embedded characterization array to obtain classification information, wherein the classification information comprises a support coefficient corresponding to each preset voice object in a plurality of preset voice objects, and the support coefficient characterizes the confidence level of the corresponding preset voice object in the interaction data of the behavior to be analyzed; determining a first selected preset voice object, of which the support coefficient is greater than a support coefficient threshold, from the plurality of preset voice objects; the step of performing similarity measurement on the integrated embedded token array according to the prototype feature parameters of the first target voice object to obtain a target integrated embedded token array corresponding to the first selected preset voice object includes: determining similarity measurement results between each embedded characterization array in the integrated embedded characterization array and the prototype characteristic parameters of the first target voice object; multiplying the similarity measurement result with the embedded characterization array, and splicing the multiplication result with the embedded characterization array to obtain a target integrated embedded characterization array.

In one embodiment, the method is performed in accordance with a voice tagging network, the method further comprising a step of commissioning the voice tagging network, comprising: obtaining a debugging learning sample in a debugging learning sample library, wherein the debugging learning sample comprises behavior interaction sample data and corresponding voice interaction sample data, and the behavior interaction sample data corresponds to a comparison voice mark set; loading the debugging learning sample into an embedding mapping component of an initial voice mark network to obtain an output sample integration embedding characterization array; the embedded mapping component is used for determining a sample routine corresponding to the behavior interaction sample data as a representation array according to a behavior coding module which is debugged in advance, determining a sample voice representation array corresponding to the voice interaction sample data according to a voice processing network which is debugged in advance, determining a sample core voice representation array in the sample voice representation array according to the sample routine as the representation array, performing characteristic interaction according to the sample routine as the representation array and the sample core voice representation array to obtain a sample integration representation array, and performing embedding operation on the sample integration representation array according to an integrated neural network which is debugged in advance to obtain the sample integration embedded representation array; loading the sample integration embedded representation array to a restoration mapping component of the initial voice mark network to perform target voice analysis to obtain inference confidence levels of a plurality of inference voice mark sets, wherein the plurality of inference voice mark sets comprise the comparison voice mark set; determining an error value according to the reasoning confidence levels of the plurality of reasoning voice tag sets; and respectively optimizing the learnable parameter values in the embedded mapping component and the restoring mapping component according to the error value until the learnable parameter values meet the debugging cut-off requirement, thereby obtaining the voice marking network.

In one embodiment, the optimizing the learnable parameters in the embedding map component and the restoring map component according to the error value includes: determining a first learning rate of a first learnable parameter according to the error value, wherein the first learnable parameter is a learnable parameter in the pre-debugging voice processing network; determining a second learning rate of a second learnable parameter according to the error value, wherein the second learning rate is greater than the first learning rate; the second learnable parameter includes a learnable parameter in the embedded map component other than the first learnable parameter and a learnable parameter in the restored map component; and respectively optimizing corresponding learnable parameter values according to the first learning rate and the second learning rate.

In one embodiment, the loading the sample integration embedded token array into the restoration mapping component of the initial voice markup network to perform target voice analysis, and obtaining the inference confidence levels of the plurality of inference voice markup sets includes: loading the sample integration embedded characterization array to the restoration mapping component to perform voice object range reasoning so as to obtain a voice object range reasoning result; the voice object range reasoning result comprises a second selected preset voice object in a plurality of preset voice objects; determining the second selected preset voice object, and corresponding second target voice object prototype characteristic parameters in a voice object prototype characteristic parameter set; performing similarity measurement on the sample integration embedded characterization array according to the prototype characteristic parameters of the second target voice object to obtain a target sample integration embedded characterization array corresponding to the second selected preset voice object; and loading the target sample integration embedded representation array to a target voice analysis module of the restoration mapping assembly to analyze target voice to obtain the reasoning confidence levels of a plurality of reasoning voice mark sets corresponding to the second selected preset voice object.

In one embodiment, the loading the sample integration embedded token array into the restoration mapping component to perform voice object range reasoning, and obtaining a voice object range reasoning result includes: extracting an embedded characterization array at a position corresponding to a preset classification identification code from the sample integrated embedded characterization array to obtain a sample classification embedded characterization array; the preset classification identification code is an identification code which is arranged at the head of the behavior interaction sample data when the sample routine representation array is generated; embedding the sample classification into a characterization array, and loading the characterization array into a classification module of the restoration mapping assembly to carry out mapping classification to obtain reasoning classification information; the reasoning classification information comprises reasoning support coefficients corresponding to each preset voice object in the plurality of preset voice objects, and the reasoning support coefficients represent confidence levels of the corresponding preset voice objects in the behavior interaction sample data; and determining a second selected preset voice object, of which the reasoning support coefficient is larger than a support coefficient threshold, in the plurality of preset voice objects to obtain a voice object range reasoning result.

In one embodiment, said determining an error value based on the inference confidence levels of the plurality of inference voice tag sets comprises: determining a first error value according to the reasoning confidence levels of a plurality of reasoning voice mark sets corresponding to the second selected preset voice object based on a first error determining function; determining a second error value based on a second error determining function according to the reasoning classification information and comparison classification priori information corresponding to the behavior interaction sample data, wherein the comparison classification priori information is used for indicating whether each preset voice object in the plurality of preset voice objects exists in the behavior interaction sample data; and determining a total error value according to the first error value and the second error value.

In one embodiment, the determining, based on the first error determining function, a first error value according to the inference confidence levels of the plurality of inference voice tag sets corresponding to the second selected preset voice object includes: when the second selected preset voice object is matched with the comparison classification priori information, the inference confidence level corresponding to the comparison voice mark set in the plurality of inference voice mark sets is made to be the first direction to the maximum, and a first error value is obtained based on the first error determining function according to the first direction and the inference confidence level of the plurality of inference voice mark sets; when the second selected preset voice object is not matched with the comparison classification priori information, the reasoning confidence level corresponding to the target reasoning voice mark set in the multiple reasoning voice mark sets is enabled to be the second direction to the maximum, a first error value is obtained based on the first error determining function according to the second direction and the reasoning confidence level of the multiple reasoning voice mark sets, and each voice object mark in the target reasoning voice mark set represents not any preset voice object.

In a second aspect, the present application provides a computer system that is a learning electric toy or a background device in communication with a learning electric toy, the computer system comprising: one or more processors; a memory; one or more computer programs; wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs, when executed by the processors, implement the methods as described above.

The beneficial effects of the application at least comprise: according to the embodiment of the application, based on the fact that the core voice representation array with semantic representation in the voice representation array is determined according to the behavior representation array, the core voice representation array with semantic representation is integrated according to the behavior representation array and the core voice representation array to obtain the integrated representation array, the integrated representation array is embedded into a preset feature domain to obtain the integrated embedded representation array, so that information such as emotion, mental state and the like represented in voice data and the behavior representation array with the same information are aligned to complete feature integration, and when target voice analysis is carried out according to the integrated embedded representation array and voice marking is carried out on the behavior interaction data to be analyzed according to the target voice marking set, recognition reliability is increased, so that better interaction control is provided, and better interaction experience is brought to users.

In the following description, other features will be partially set forth. Upon review of the ensuing disclosure and the accompanying figures, those skilled in the art will in part discover these features or will be able to ascertain them through production or use thereof. The features of the present application may be implemented and obtained by practicing or using the various aspects of the methods, tools, and combinations that are set forth in the detailed examples described below.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a flowchart of a learning type electric toy control method based on man-machine interaction according to an embodiment of the present application;

Fig. 2 is a schematic diagram of a computer system according to an embodiment of the application.

Detailed Description

Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. The terminology used in the description of the embodiments of the application is for the purpose of describing particular embodiments of the application only and is not intended to be limiting of the application.

The execution main body of the learning type electric toy control method based on man-machine interaction in the embodiment of the application is a computer system, including but not limited to a server, a personal computer, a notebook computer, a tablet personal computer, a smart phone and the like. Servers include, but are not limited to, a single web server, a server group of multiple web servers, or a cloud of large numbers of computers or web servers in a cloud computing, where cloud computing is a type of distributed computing, a super virtual computer consisting of a collection of loosely coupled computers. The computer system can be used for realizing the application by running alone, and can also be accessed into a network and realized by interaction with other computer systems in the network. Wherein the network on which the computer system is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like. It is understood that the computer system is communicatively connected to the learning electric toy, in other words, the learning electric toy has a networking function to upload data to the computer system, and it is to be noted that the learning electric toy needs to be allowed by a user before acquiring user related data, and legal data is acquired within the scope of legal regulations.

The embodiment of the application provides a learning type electric toy control method based on man-machine interaction, which is applied to a computer system, as shown in fig. 1, and comprises the following steps: a learning type electric toy control method based on man-machine interaction comprises the following steps: step S110: and acquiring a behavior representation array of the behavior interaction data to be analyzed and a voice representation array of the voice interaction data to be analyzed, wherein the behavior interaction data to be analyzed is matching behavior data corresponding to the voice interaction data to be analyzed.

In step S110, a computer system (e.g., a learning electric toy or a background device in communication with the learning electric toy) acquires behavior interaction data between a user and the toy. The behavioral interaction data is generated during the interaction of the user with the toy and may include various behaviors of the user, such as expressions, actions, and the like. For example, when a user pushes the toy, the learning electric toy may capture data about the action, such as the force, speed, direction, etc. of the pushing. Likewise, when the user beats or follows the toy, corresponding behavior data is also generated.

In order to further analyze and process these behavioral interaction data, the computer system performs feature extraction on the behavioral interaction data to obtain a behavioral characterization array. Feature extraction is a common technique in machine learning that can extract meaningful information from raw data for subsequent model training and reasoning. In the embodiment of the application, the behavior data of the user can be converted into the form of the feature vector, so that the computer system can conveniently process and analyze the behavior data.

Meanwhile, the computer system also acquires voice interaction data between the user and the toy. The voice interaction data is interaction voice between the user and the electric toy, and can contain information such as instructions, inquiry, emotion expression and the like of the user. For example, when the user speaks "toy, walk forward," this sentence is a piece of voice interaction data. Similar to the behavioral interaction data, the computer system also performs feature extraction on the behavioral interaction data to obtain a phonetic representation array. The speech features may include acoustic features such as pitch, intensity, duration, timbre, etc., and linguistic features such as vocabulary, grammar, semantics, etc. These features can reflect the user's voice characteristics and interactive intent, which are important bases for subsequent voice markup.

It is noted that the behavioral interaction data and the voice interaction data to be analyzed are matched, i.e. they correspond to the user behavior at the same moment. This means that when a user gives a certain voice command, their behavior data is also recorded at the same time. This matching relationship is critical for the subsequent formulation of voice markup and interactive control strategies.

For example, suppose a child plays a learning electric toy car while he pushes the toy car and speaks "quick-running". In this scenario, the behavior data of the toy vehicle and the voice data of the "sprint" are a set of matching interaction data. The computer system may extract feature vectors of the two sets of data, such as a behavior characterization array and a speech characterization array, respectively, for subsequent analysis and processing.

Step S120: and determining a core voice characterization array in the voice characterization arrays according to the behavior characterization arrays, wherein the core voice characterization array is the voice characterization array with semantic characterization involved in the behavior characterization arrays.

Step S120 involves screening the core speech characterization arrays associated with the behavior characterization arrays from the speech characterization arrays. The purpose is to determine which speech features are closely related to user behavior, thereby providing a more accurate data basis for subsequent voice tag parsing and control strategy formulation.

Specifically, the computer system analyzes and processes the speech characterization array in this step based on the behavior characterization array that has been acquired. The behavior characterization array is a characterization representation of user behavior data that includes a plurality of characteristic dimensions of the user behavior, such as action type, action frequency, expression change, and the like. These features reflect the behavior characteristics and patterns of the user during interaction with the toy.

The computer system establishes an association between the behavior characterization array and the speech characterization array using a machine learning algorithm or model, such as a neural network in deep learning. Such association can be achieved by training a classifier or regressor with the behavior token array as input and the speech token array as output or label. By training and optimizing this model, the computer system can learn which speech features are closely related to user behavior. Once such associations are established, the computer system may screen out the core speech characterization arrays from the speech characterization arrays based on the behavior characterization arrays. The core speech characterization array refers to a speech characterization array having semantic characterizations involved in the behavior characterization array, that is, the speech features are semantically associated with the user behavior. For example, when the user speaks the "forward" instruction, the matching action may be to push the toy vehicle forward, and then the speech feature of the "forward" word belongs to the core speech characterization array. Or when the user speaks the instruction of "I am good to confuse" I am to sleep ", the matched behavior can be that the expression of the user is tired (the expression and the action recognition can be realized by adopting a general algorithm), and then the voice feature of the word of" trapping "belongs to the core voice characterization array.

To illustrate this process more specifically, in one example, assume a child is playing a learning electric toy vehicle, he speaks the word "sprint" and simultaneously pushes the toy vehicle. The computer system extracts the feature vector of the action and the speech feature vector of the word "sprint", respectively. Then, with the model already trained, the computer system may determine that the speech feature vector of the word "sprint" is semantically related to the behavior feature vector of the toy vehicle being propelled. Thus, the speech feature vector of the word "sprint" is selected as part of the core speech characterization array.

The association relation between the behavior representation array and the voice representation array is established by utilizing a machine learning algorithm and a model, and the core voice representation array is screened out according to the association relation. This step provides a more accurate and targeted data basis for subsequent voice tag parsing and control strategy formulation.

In one embodiment, in step S120, determining a core speech characterization array from the speech characterization arrays according to the behavior characterization array may specifically include: step S121: the cross-attention impact coefficient is determined from the behavior characterization array and the speech characterization array.

In step S121, a cross-attention impact coefficient, i.e. a weight value, is determined from the behavior characterization array and the speech characterization array, the cross-attention (cross attention) being called cross-attention. The cross-attention influence coefficient reflects the correlation and importance between the behavior data and the voice data and is the basis for subsequent correction of the voice characterization array. To determine the cross-attention impact coefficient, the computer system may employ an attention mechanism in machine learning. The attention mechanism is a technique that simulates human visual attention and allows the model to place more attention on the parts relevant to the current task while ignoring the parts that are not relevant when processing information. In this scenario, the attention mechanism may help the computer system find the speech characterization array portion associated with the behavior characterization array.

In particular, the computer system may construct a cross-attention model whose inputs are the behavior characterization array and the speech characterization array and whose outputs are the cross-attention influence coefficients. The model may be a neural network, such as a multi-layer perceptron (MLP), convolutional Neural Network (CNN), or Recurrent Neural Network (RNN), etc. In the model training process, the computer system learns a cross-attention mapping function according to the corresponding relation between the behavior characterization array and the voice characterization array, and the function can calculate the importance or the relevance of each voice feature to the behavior feature. For example, assume that the behavior characterization array includes features such as a user's force, speed, and direction to propel the toy vehicle, while the speech characterization array includes speech features for instructions such as "fast run", "slow point", "left turn", etc., spoken by the user. The computer system will use the cross-attention model to calculate the importance or relevance of each speech feature to the behavioral features. For example, the voice characteristics of the instruction "fast running" may be highly correlated to the dynamics and speed characteristics of the propelled toy vehicle, so that the cross-attention impact coefficient therebetween may be relatively large. By determining the cross-attention impact coefficients, the computer system can learn which speech features are closely related to user behavior, thereby providing a basis for subsequent speech characterization array corrections. In step S122, the computer system corrects the speech characterization array according to the cross-attention impact coefficients to obtain a core speech characterization array. The method for correcting can be to perform operations such as weighted summation, feature selection or feature fusion on the original voice representation array, so that the corrected voice representation array is more in accordance with the characteristics and modes of user behaviors.

Step S122: and correcting the voice characterization array according to the cross-attention influence coefficient to obtain a core voice characterization array.

The purpose of step S122 is to screen out the speech features most relevant to the user behavior for more accurate subsequent voice tag parsing and control instruction generation. Specifically, the computer system adjusts each element in the speech characterization array according to the cross-attention-influencing coefficients. The cross-attention-impact coefficient reflects the degree of correlation between the phonetic representation and the behavioral representation, with larger elements representing the phonetic feature being closely related to the current phonetic signature and smaller elements representing less relevance. Thus, during the correction process, the computer system increases the weight of the speech characterization element with the larger coefficient and decreases or eliminates the weight of the element with the smaller coefficient. The method for modifying the speech characterization array may be to multiply each speech characterization element by its corresponding cross-attention influence coefficient by means of weighted averaging, and then add the results to obtain a weighted speech characterization array. Thus, more behavior-related speech features will occupy a greater weight in the revised array.

For example, assume that the speech characterization array contains speech features of a plurality of words spoken by the user, such as "forward", "backward", "left turn", "right turn", and so forth. Step S121 calculates a cross-attention impact coefficient between the speech feature of each word and the current behavior characterization array. In step S122, the computer system adjusts the weight of the speech feature of each word in the array based on the coefficients. If the user's current behavior is to propel the toy vehicle forward, the voice feature of the word "forward" may be given a higher weight because it is highly correlated with the current behavior. While other words such as "back", "left turn" will be relatively low weighted. After correction in this way, the obtained core voice characterization array is more focused on voice features closely related to the current voice mark, which is helpful for improving accuracy of subsequent behavior analysis and control. This is critical to achieving autonomous behavior and interactive capabilities of the intelligent learning electric toy.

Step S130: and carrying out feature interaction according to the behavior representation array and the core voice representation array to obtain an integrated representation array, and embedding the integrated representation array into a preset feature domain to obtain an integrated embedded representation array.

Step S130 involves feature interaction and integration of the behavior token array and the core speech token array, and embedding the integrated token array into a predetermined feature domain. The method aims at fusing the behavior and voice information of the user into a unified characterization space so as to facilitate subsequent voice mark analysis and control strategy formulation.

Specifically, the computer system first performs a feature interaction of the behavior characterization array with the core speech characterization array. The feature interactions may take a variety of forms, such as stitching, weighted summation, convolution fusion, and the like. Taking the splicing as an example, the computer system can splice the behavior characterization array and the core voice characterization array according to a certain dimension to form a new characterization array, namely, an integrated characterization array. The integrated characterization array simultaneously contains the behavior characteristics and the voice characteristics of the user, and realizes the organic fusion of the two information.

Next, the computer system embeds the integrated token array into a predetermined feature field. The preset feature field is a high-level semantic space that can map the feature information of the lower level to the semantic concepts of the higher level. The embedding may be performed by training an encoder, which may be a neural network model, such as a self-encoder, convolutional neural network, or the like. Through training, the encoder can learn the best way to map the integrated token array to a preset feature field. Taking the example of a self-encoder, a computer system would use a large amount of sample data to train the self-encoder. The self-encoder consists of an encoder and a decoder, wherein the encoder is responsible for compressing input data (i.e. the integrated token array) into a low-dimensional embedded vector, and the decoder is responsible for restoring the embedded vector into the original data. During the training process, the computer system will constantly optimize the parameters of the encoder and decoder so that the difference between the restored data and the original data is minimized. After training is completed, the encoder section may be used to embed the integrated token array into a predetermined feature field. The embedded integrated token array is referred to as an integrated embedded token array, which contains unified tokens of user behavior and speech information in a high-level semantic space. This array can be used in subsequent voice markup parsing and control strategy formulation to achieve a more intelligent, natural human-machine interaction experience. For example, in an application scenario of a learning electric toy vehicle, the integrated embedded token array obtained in step S130 may be used to determine whether the user intends to advance, retract, turn, etc. the toy vehicle, so as to make a corresponding control command.

In one embodiment, in step S130, performing feature interaction according to the behavior token array and the core speech token array to obtain an integrated token array may specifically include: step S131: an internal attention impact coefficient is determined from the behavior characterization array.

In step S131, the computer system determines an internal attention impact coefficient, i.e., a weight value, from the behavior characterization array, internal attention (self attention), also called self-attention. The internal attention impact coefficients reflect the correlation and importance between different features in the behavior characterization array, helping the computer system to more accurately understand and parse the user's speech content.

To determine the internal attention impact coefficient, the computer system may employ a self-attention mechanism, which is an attention technique widely used in the field of deep learning. The self-attention mechanism allows the model to focus on correlations between different locations within the sequence as it processes the sequence data. In this scenario, the behavior representation array may be considered a sequence, where each element represents a behavior feature. Through the self-attention mechanism, the computer system can calculate the importance or relevance of each behavioral characteristic to the overall phonetic signature. In particular, the computer system may construct a self-attention model whose inputs are arrays of behavior characterizations and whose outputs are internal attention-influencing coefficients. This model may be a neural network, such as the self-attention layer in the transducer model. During model training, the computer system learns a self-attention mapping function based on the relationships between features in the behavior representation array, which can calculate the self-attention weight of each feature. For example, assume that the behavior characterization array contains characteristics such as the strength, speed, direction, and time interval of the user pushing the toy vehicle. The computer system will use the self-attention model to calculate the correlation between these features. For example, the force and speed characteristics may be highly correlated, as a user will typically adjust the force and speed simultaneously while pushing the toy vehicle. Thus, the internal attention impact coefficient between these two features will be relatively large. Conversely, the directional and time interval features may have a low correlation with other features, and thus their internal attention impact coefficients may be relatively small.

By determining the internal attention impact coefficients, the computer system can learn which behavioral characteristics are interrelated and which characteristics are relatively independent. This information is important for subsequent voice tag parsing and control strategy formulation. In step S132, the computer system corrects the behavior characterization array according to the internal attention impact coefficients to obtain a core behavior characterization array. The method for correcting can be to perform operations such as weighted summation, feature selection or feature fusion on the original behavior representation array, so that the corrected behavior representation array is more in line with the characteristics and modes of the user behavior.

Finally, in step S133, the computer system performs feature interaction (i.e. performs feature fusion) on the behavior characterization array, the core speech characterization array, and the core behavior characterization array to obtain an integrated characterization array. The integrated token array contains both behavioral and speech characteristics of the user and considers the correlation between behavioral characteristics. Such an integrated token array can provide a more comprehensive and accurate information basis for subsequent voice tag parsing and control strategy formulation.

Step S132: and correcting the behavior characterization array according to the internal attention influence coefficient to obtain a core behavior characterization array.

In step S132, the original behavior characterization array is adjusted and optimized according to the internal attention impact coefficient calculated in step S131. The internal attention impact coefficients, also referred to as self-attention weights, reflect the correlation and importance between different features in the behavior characterization array. These coefficients are calculated by a self-attention mechanism that allows the model to focus on correlations between different locations within the sequence as it processes the sequence data. In this scenario, the behavior representation array may be considered a sequence, where each element represents a behavior feature. The purpose of modifying the behavior characterization array is to screen out features most relevant to the user behavior for later more accurate voice tag parsing and control instruction generation. The correction method can be implemented by means of weighted average, feature selection or feature fusion. Specifically, the computer system adjusts each element in the behavior characterization array based on the internal attention-influencing coefficients. For the features with larger weights, the computer system increases the specific gravity of the features in the corrected array; for the features with smaller weight, the computer system reduces or eliminates the weight of the features in the corrected array.

Taking a weighted average as an example, assume that the behavior characterization array contains characteristics of the strength, speed, direction, etc. of the user pushing the toy vehicle, and that the internal attention impact coefficient of each characteristic has been calculated by a self-attention mechanism. In step S132, the computer system multiplies the value of each feature by its corresponding internal attention impact coefficient and then sums the results to obtain a weighted array of behavior characterizations. Thus, features that are more relevant to user behavior will take up a greater weight in the revised array. Through the correction of step S132, the obtained core behavior characterization array is more focused on the features closely related to the user behavior, which is helpful for improving the accuracy of subsequent behavior analysis and control. This is critical to achieving autonomous behavior and interactive capabilities of the intelligent learning electric toy.

It should be noted that in practical applications, the method for modifying the behavior characterization array may be selected and designed according to specific requirements and situations. Other methods besides weighted averaging, such as feature selection, feature fusion, etc., may be used for correction. Meanwhile, the calculation of the internal attention influence coefficient can also be realized by adopting different self-attention models or algorithms.

Step S133: and performing feature interaction on the behavior characterization array, the core voice characterization array and the core behavior characterization array to obtain an integrated characterization array.

Step S133 is a key step of performing feature interaction (feature fusion) on the behavior characterization array, the core voice characterization array and the core behavior characterization array to obtain an integrated characterization array in the learning type electric toy control method. The method aims at integrating the behavior and voice information of the user and the relevance between the behavior and the voice information into a unified characterization space, and providing a comprehensive information basis for subsequent voice mark analysis and control strategy formulation.

Feature interactions, also known as feature fusion, refer to the combining and integration of features from different sources or different levels to form a new, more expressive feature set. In this scenario, the behavior characterization array reflects the behavior characteristics of the user, the core speech characterization array reflects the speech characteristics of the user, and the core behavior characterization array is corrected to be more focused on the feature set closely related to the user behavior.

In step S133, the computer system may employ a particular algorithm or model to implement the feature interactions. These algorithms or models may be machine learning based methods such as deep neural networks, convolutional neural networks, recurrent neural networks, and the like. The specific choice of which algorithm or model depends on the nature of the data and the requirements of the task.

Taking a deep neural network as an example, the computer system can design a multi-layer neural network structure, takes the behavior characterization array, the core voice characterization array and the core behavior characterization array as input, and obtains a fused characterization vector through multi-layer nonlinear transformation and feature extraction. The fused token vector contains both the behavior information and the voice information of the user and considers the relevance between the behavior information and the voice information of the user.

In practical applications, the specific implementation manner of the feature interaction can be designed and adjusted according to specific requirements and scenes. In addition to simple stitching or weighted averaging, more complex feature interaction methods may be employed, such as attention-mechanism-based feature fusion, graph-model-based feature fusion, and the like. The method can better capture the interaction and the dependency relationship between different features, and improve the expression capacity and the generalization performance of the integrated characterization array. Through the feature interaction in step S133, the obtained integrated characterization array provides a more comprehensive and accurate information basis for the voice content analysis and control strategy formulation of the user. This is critical to achieving autonomous behavior and interactive capabilities of the intelligent learning electric toy.

Step S140: and analyzing the target voice according to the integrated embedded token array to obtain a target voice mark set.

In step S140, the computer system analyzes the voice content of the user by using the integrated embedded token array obtained in the previous step through a specific algorithm or model, and marks the voice content as a target voice mark set. The number of targeted voice tag sets may be multiple, each tag may be one structured datum, for example may be presented in a "voice object tag-voice object" format, as the tags may include emotional and mental states (in other scenarios, learning states such as confused, understood, etc., may also be included, for example), helping the computer system to more clearly understand and recognize the user's behavior. For example, "mood-haha (representing happiness)", "mental state-dash (representing excitement)", although the marking may be implemented by other means, such as numerical values or letters.

To achieve parsing of the user's voice tag, the computer system may employ various machine learning models, such as classifiers, clusters, sequence models, and the like. These models will learn and recognize the user's speech content based on integrating features embedded in the token array. Specifically, the model may infer the user's behavioral intent and habits by analyzing and integrating relationships and patterns between elements embedded in the token array. Taking the classifier as an example, the computer system may train a classification model, such as a Support Vector Machine (SVM), decision tree, or deep learning model, to identify different user phonetic labels. During training, the classifier learns how to classify user behavior into different categories based on integrating features embedded in the token array. Once the model training is complete, it can be used to parse the new user behavior data and tag it as a corresponding phonetic signature.

In addition, sequence models such as Recurrent Neural Networks (RNNs) or long short term memory networks (LSTM) are also suitable for parsing user behavior data with timing dependencies. These models may capture temporal dynamics and patterns in the sequence of user behaviors to more accurately identify the user's voice content.

It should be noted that the specific form and content of the set of voice tags depends on the application scenario and task requirements. For example, in an electronic toy control scenario, the set of voice markers may be a sequence of data including markers of the user's emotion, mental state, game preferences, etc. These indicia help the computer system better understand the user's needs and formulate personalized control strategies based thereon. Through the target voice parsing of step S140, the computer system may gain a deep understanding about the user' S behavior and provide powerful support for subsequent control instruction generation. This makes the electronic toy of study type can respond to user's operation and demand more intelligently, promotes user experience and interactivity.

In step S140, as an implementation manner, the target voice analysis is performed according to the integrated embedded token array to obtain a target voice tag set, which may specifically include: step S141: and determining a first selected preset voice object in the plurality of preset voice objects according to the integrated embedded representation array, wherein the first selected preset voice object is a preset voice object in the behavior interaction data to be analyzed.

In step S141, the computer system determines a first selected preset voice object of the plurality of preset voice objects according to the integrated embedded token array. This first selected preset speech object is a key speech object in the behavioral interaction data to be analyzed, which is crucial for subsequent target speech parsing.

The preset voice object is a series of characteristics or labels which are defined in advance and used for describing and classifying the behaviors of the user. The preset voice objects may include voice objects of emotion type (e.g. voice objects are "dashes", "ha", "sprint", "good stick" corresponding voice objects are emotion type voice objects) and voice objects of mental state type (e.g. voice objects are "eujer", "continue", "tired", "stranded" corresponding voice objects are mental state type voice objects).

In one embodiment, in step S141, determining a first selected preset voice object from the plurality of preset voice objects according to the integrated embedded token array may specifically include: step S1411: and extracting an embedded characterization array at a position corresponding to a preset classification identification code from the integrated embedded characterization array to obtain a classification embedded characterization array, wherein the preset classification identification code is an identification code which is arranged in the interactive data header of the behavior to be analyzed in the behavior characterization array.

The main task of step S1411 is to extract the embedded token array corresponding to the preset classification identifier from the integrated embedded token array to obtain the classified embedded token array. Firstly, it needs to be clear what is the integrated embedded token array, which is obtained in step S133, and fuses the information of the user' S behavior token array, the core speech token array and the core behavior token array. The array contains rich user behavior characteristics and is the basis for target voice analysis.

In the embodiment of the application, the integrated embedded characterization array is a multidimensional array obtained after processing user behavior interaction data through a certain algorithm or model, contains various characteristic information of user behaviors, and is the basis of subsequent analysis and decision. And the preset classification identifier is an identifier which is arranged in the head of the behavior interaction data to be analyzed when generating the behavior representation array. The function of this identification code is to identify and locate information related to a particular category. In the integrated embedded token array, the embedded token array corresponding to the preset classification identification code, namely the classified embedded token array, comprises characteristic information related to the class.

Therefore, in the embodiment of step S1411, the computer system searches and locates the integrated embedded token array according to the preset classification identifier, and extracts the embedded token array corresponding to the identifier. This process can be analogically to retrieving information from a database based on a particular keyword. In this way, the computer system is able to obtain user behavior feature information associated with a particular class, providing data support for subsequent class mapping and voice object selection.

Step S1412: mapping classification is carried out according to the classification embedded characterization array to obtain classification information, wherein the classification information comprises a support coefficient corresponding to each preset voice object in the plurality of preset voice objects, and the support coefficient characterizes the confidence level of the corresponding preset voice object in the interaction data of the behavior to be analyzed.

In step S1412, the computer system associates each preset voice object with its confidence level in the behavioral interaction data by a specific mapping method according to the information embedded in the token array by classification.

Mapping classification is a technique that maps input data to a particular class. Here, the input data is a class-embedded token array, and the output is a support coefficient corresponding to each preset speech object. The support factor is a value that indicates the level of confidence that the corresponding pre-set speech object exists in the behavioral interaction data to be analyzed. In other words, the higher the support coefficient, the greater the likelihood that the preset speech object appears in the behavioral interaction data. To implement the mapping classification, the computer system may employ a linear mapping approach. Linear mapping is a simple mapping method that multiplies each element of input data by a weight and adds an offset term to obtain output data. In this scenario, the computer system maps the channel dimension of the class-embedded token array to a predetermined number of speech objects. The channel dimension is an important feature embedded in the token array that represents user behavior information of different aspects. Through linear mapping, the computer system can correlate these channel dimensions with a preset speech object. In addition, to limit the mapping results to a reasonable range, the computer system may also use Logistic functions. A Logistic function is a commonly used activation function that maps input values into the interval 0 to 1. In this scenario, the Sigmoid function functions to convert the result of the linear mapping into support coefficients. By processing the Logistic function, each preset speech object gets a support coefficient between 0 and 1, which represents the confidence level that the speech object exists in the behavioural interaction data.

For example, suppose that the class-embedded token array is a vector [0.5, 0.8] with 2 channel dimensions, and that there are 2 preset speech objects (the number in practice is huge): is happy and tired. The computer system maps this vector into 2 support coefficients by linear mapping. These three support coefficients are then converted to values between 0.0 and 1.0 using the Sigmoid function. The resulting classification information may be similar to: the corresponding support coefficient is 0.6 for happiness, and 0.7 for high accumulation. This information indicates that the confidence level of the occurrence is highest too much in the behavioral interaction data to be analyzed, followed by happiness.

Step S1413: and determining a first selected preset voice object with a support coefficient larger than a support coefficient threshold value in the plurality of preset voice objects.

In step S1413, the computer system determines according to the support coefficient calculated in the previous step S1412. The support factor is a value between 0 and 1, indicating the confidence level of the presence of the corresponding preset speech object in the behavioral interaction data to be analyzed. The higher the support coefficient, the greater the likelihood that the preset speech object will appear in the behavioural interaction data. To determine the first selected preset speech object, the computer system sets a support factor threshold. The threshold is a fixed value used to filter out those preset speech objects that support higher coefficients. Only if the support coefficient of a certain preset speech object is larger than this threshold value will it be selected as the first selected preset speech object. For example, assume that there are three preset voice objects A, B and C whose support coefficients are 0.8, 0.5, and 0.2, respectively. If the support factor threshold is set to 0.6, then only the support factor for voice object a is greater than this threshold, so voice object a will be selected as the first selected preset voice object. The setting of the support coefficient threshold is set according to specific scenes and requirements, and is not particularly limited.

Step S142: determining a first target voice object prototype feature parameter corresponding to a first selected preset voice object in a voice object prototype feature parameter set, wherein the voice object prototype feature parameter set comprises voice object prototype feature parameters of each preset voice object in a plurality of preset voice objects.

In step S142, the computer system needs to determine a first target speech object prototype feature parameter corresponding to the first selected preset speech object in the speech object prototype feature parameter set. These parameters are the basis for subsequent similarity metrics and target speech parsing. The speech object prototype feature parameter set is a set comprising a plurality of preset speech objects and their corresponding feature parameters. Each preset speech object has a set of prototype feature parameters associated with it, which are represented in the form of vectors for describing the different features of the speech object. For example, in the case of an electric toy, a "sprint" is a preset speech object whose corresponding prototype feature parameters are set in advance.

In step S142, the computer system finds prototype feature parameters corresponding to the first selected preset voice object from the prototype feature parameter set of the voice object according to the first selected preset voice object determined in step S141. This process may be implemented by searching, matching or indexing, depending on the organization and storage of the prototype feature parameter set of the speech object. Once the prototype feature parameters of the first target speech object corresponding to the first selected preset speech object are found, the computer system may use these parameters for subsequent similarity measurement and target speech analysis. These prototype feature parameters provide a reference standard for a computer system to accurately measure the similarity or difference between user behavior and a pre-set speech object.

It should be noted that the parameters in the speech object prototype feature parameter set are learnable, meaning that during use of the computer system, the parameters can be updated and optimized based on actual data and user feedback. By constantly learning and adjusting prototype feature parameters, the computer system can gradually increase the accuracy and efficiency of target speech analysis. The computer system provides the necessary basis and reference for subsequent similarity measurement and target speech parsing by determining the prototype feature parameters of the first target speech object corresponding to the first selected preset speech object.

Step S143: and carrying out similarity measurement on the integrated embedded characterization array according to the prototype characteristic parameters of the first target voice object to obtain a target integrated embedded characterization array corresponding to the first selected preset voice object.

In step S143, the computer system performs similarity measurement on the integrated embedded token array by using the prototype feature parameters of the first target voice object to obtain a target integrated embedded token array highly related to the first selected preset voice object. A similarity measure is a process of calculating the degree of similarity between two objects. The computer system needs to compare the similarity between the elements embedded in the token array and the prototype feature parameters of the first target speech object. To achieve this, various similarity measurement algorithms may be employed, such as cosine similarity, euclidean distance, pearson correlation coefficient, and the like. Taking cosine similarity as an example, the computer system may consider both the integrated embedded token array and the first target speech object prototype feature parameters as vectors in a vector space. The similarity between these two vectors is then measured by calculating their cosine angle. The cosine similarity has a value range of [ -1, 1], with a value closer to 1 indicating that the two vectors are more similar and a value closer to-1 indicating that the two vectors are less similar.

In practical application, the computer system performs similarity measurement on each element in the integrated embedded token array and the prototype feature parameters of the first target voice object, and screens out elements highly related to the first selected preset voice object according to the measurement result. The selected elements form a target integration embedded characterization array corresponding to the first selected preset voice object. It should be noted that the similarity measurement process in step S143 may be adjusted and optimized according to the specific application scenario and requirement. For example, different similarity metric algorithms may be employed, different similarity thresholds set, other features or constraints introduced, and so forth. These adjustments and optimizations help to improve the accuracy and efficiency of target speech parsing.

And obtaining the target integrated embedded characterization array highly related to the first selected preset voice object by using the prototype characteristic parameters of the first target voice object to carry out similarity measurement on the integrated embedded characterization array. This provides an important data base and support for subsequent target speech parsing.

In step S143, as an implementation manner, the similarity measurement is performed on the integrated embedded token array according to the prototype feature parameters of the first target voice object to obtain a target integrated embedded token array corresponding to the first selected preset voice object, which may specifically include: step S1431: and determining a similarity measurement result between each embedded characterization array in the integrated embedded characterization arrays and the prototype characteristic parameters of the first target voice object.

In step S1431, the computer system measures the similarity between each of the integrated embedded token arrays and the prototype feature parameters of the first target speech object. Such similarity measures are techniques commonly used in machine learning to quantify the degree of similarity between two data points or feature sets. Specifically, the computer system uses a similarity measurement algorithm, such as cosine similarity, euclidean distance, manhattan distance, etc., to calculate the similarity between each embedded token array and the prototype feature parameters of the first target speech object. These algorithms calculate the angle, distance or degree of overlap between the two vectors based on the values in the feature vectors, and thus derive their similarity.

Taking cosine similarity as an example, it measures the cosine value of the included angle of two vectors in the multidimensional space, and the closer the value is to 1, the more similar the two vector directions are, namely the more similar the two data points are. The computer system calculates cosine similarity between the feature vector of each embedded token array in the integrated embedded token array and the feature vector of the prototype feature parameter of the first target speech object.

Assume that there are three embedded token arrays A, B, C in the integrated embedded token array, each of which is a multi-dimensional feature vector. The prototype feature parameter of the first target speech object is also a multidimensional feature vector, denoted P. The computer system calculates cosine similarities between A and P, B and P, C and P, respectively, to obtain three similarity measurement results.

Step S1432: multiplying the similarity measurement result by the embedded characterization array, and splicing the multiplication result and the embedded characterization array to obtain the target integrated embedded characterization array.

In step S1432, first, the computer system acquires the similarity measurement result calculated in step S1431. These results quantify the degree of similarity between each of the integrated embedded token arrays and the first target speech object prototype feature parameters. The similarity measure is typically a number or vector whose size reflects the strength of the similarity. Next, the computer system multiplies each embedded token array with its corresponding similarity metric result. This multiplication operation can be understood as a weighting process on the embedded token array, with the similarity metric result as a weight. If a certain embedded token array is very similar to the prototype feature parameters of the first target speech object, the similarity measurement result will be a larger value, and the contribution of the embedded token array in the target integrated embedded token array will be larger after multiplication.

For example, assume that there are three embedded token arrays A, B, C in the integrated embedded token array, and that the similarity metrics with the first target speech object prototype feature parameters result in 0.9, 0.5, and 0.7, respectively. The computer system multiplies the A array by 0.9 of the similarity measurement result to obtain a weighted A array; similarly, the B-array and the C-array are multiplied by their corresponding similarity metrics to obtain weighted B-array and C-array.

Finally, the computer system performs a stitching operation on the weighted embedded token array and the original embedded token array. The stitching may be performed in different dimensions, such as the channel dimension. Through the stitching operation, the computer system merges the original embedded token array and the weighted embedded token array into a larger array, i.e., the target integrated embedded token array. This array contains both the original embedded characterization information and highlights the more similar parts to the prototype feature parameters of the first target speech object by weighting.

It should be noted that in practical applications, the splicing operation may involve problems such as shape adjustment and dimension matching of the array. The computer system needs to perform corresponding processing according to the specific array shape and the splicing requirement so as to ensure the correctness and the effectiveness of the splicing operation.

Step S144: and carrying out target voice analysis according to the target integration embedded representation array to obtain a target voice mark set corresponding to the first selected preset voice object.

Specifically, for each target integrated embedded token array, target voice analysis may be performed based on the target integrated embedded token array to obtain emotion and mental state marks (of course, other marks may be used in other embodiments) of a first selected preset voice object corresponding to the target integrated embedded token array, that is, each first selected preset voice object predicts support coefficients corresponding to two marks of emotion and mental state in each position of the corresponding voice (for example, a softmax classifier or a fully connected network is used for prediction), and then, the marking results of each first selected preset voice object can be obtained for each first selected preset voice object, so as to form a target voice mark set.

Step S150: and formulating a target interaction control strategy according to the target voice mark set so as to control the learning type electric toy according to the target interaction control strategy.

In step S150, the computer system formulates a target interaction control strategy based on the aforementioned highly structured set of target voice tags. This step involves converting the complex information contained in the markers into specific control commands that can direct the learning electric toy to respond.

For example, as exemplified in the foregoing step S140, assume that the identified markers are: "mood-haha (indicating happiness)" and "mental state-dash (indicating excitement)". These markers not only characterize the emotion and mental state that the user is experiencing, but also provide clues to how to control the learning electric toy. The computer system considers the specific meaning and context of these tags in formulating the target interaction control strategy. For example, for the mark "emotion-haha (meaning happy)", the control strategy may cause the toy to make some actions that can enhance the user's pleasure, such as playing happy music, jumping a stretch of dance, or exhibiting some lovely expression. For the "mental state-dash (indicate excitement)" mark, the control strategy may allow the toy to exhibit more vigor and power, such as accelerating movement, flashing lights, or making an exciting sound.

To implement this conversion process, the computer system may rely on predefined rules or machine learning models. The predefined rule may be some simple mapping, such as mapping "ha" to an instruction to play music. While the machine learning model can learn complex relationships between the markers and the control commands through training. Such a model may continuously optimize its predictive power based on historical data and user feedback. Finally, the formulated target interaction control strategy will enable the learning electric toy to respond to the user's voice instructions in a more intelligent and personalized manner. This not only promotes the interactivity of the toy, but also enhances the user experience.

In summary, the embodiment of the application determines the core voice characterization array with semantic characterization in the behavior characterization array according to the behavior characterization array, integrates the core voice characterization array according to the behavior characterization array and the core voice characterization array to obtain the integrated characterization array, embeds the integrated characterization array into a preset feature domain to obtain the integrated embedded characterization array, aligns information such as emotion, mental state and the like characterized in voice data and the behavior characterization array with the same information to complete feature integration, and increases recognition reliability when performing target voice analysis according to the integrated embedded characterization array and performing voice marking on the behavior interaction data to be analyzed according to the target voice marking set, thereby providing better interaction control and bringing better interaction experience for users.

In a possible implementation manner, the method provided by the embodiment of the present application is implemented according to a voice markup network, and then the method provided by the present application further includes a step of debugging the voice markup network, which may specifically include the following steps: step S10: obtaining a debugging learning sample in a debugging learning sample library, wherein the debugging learning sample comprises behavior interaction sample data and corresponding voice interaction sample data, and the behavior interaction sample data corresponds to a comparison voice mark set.

Step S10, obtaining a debugging learning sample from a debugging learning sample library. These examples are the basis for constructing and training a voice tagging network. The debugging learning sample library is a collection storing a large amount of behavior interaction sample data and corresponding voice interaction sample data. Each sample contains a description of a behavioral interaction scenario and speech data generated in that scenario. These examples are used to train a machine learning model to enable it to accurately recognize and understand different phonetic symbols.

The meaning of the behavioral interaction sample data and the voice interaction sample data may refer to the description about the behavioral interaction data and the voice interaction data in step S110, which is not described herein. The reference voice tag set is a correct structured description of voice interaction sample data, existing as a priori tags.

Step S20: and loading the debugging learning sample into an embedding mapping component of the initial voice mark network to obtain an output sample integration embedding characterization array.

The embedded mapping component is used for determining a sample routine corresponding to the behavior interaction sample data as a representation array according to a behavior coding module which is debugged in advance, determining a sample voice representation array corresponding to the voice interaction sample data according to a voice processing network which is debugged in advance, determining a sample core voice representation array in the sample voice representation array according to the sample routine as the representation array, performing characteristic interaction according to the sample routine as the representation array and the sample core voice representation array to obtain a sample integration representation array, and performing embedding operation on the sample integration representation array according to an integrated neural network which is debugged in advance to obtain a sample integration embedded representation array.

In step S20, the computer system loads the debug learning samples into an embedded mapping component of the initial voice markup network, which is essentially a feature encoder responsible for converting the input raw data into a mathematical form that is easier to process and analyze.

Specifically, the embedding mapping component first receives debug learning samples that include behavioral interaction sample data and corresponding voice interaction sample data. The behavioral interaction sample data reflects the phonetic labels of the user or entity, and the phonetic interaction sample data is phonetic representations corresponding to these behaviors. The task of the embedding mapping component is to transform these raw data into a mathematical form called embedding characterization.

The workflow of embedding the mapping component can be divided into several key steps. First, it uses a pre-debug (i.e., pre-train) good behavior encoding module to process behavior interaction sample data. The function of this module is to convert the behavior data into a mathematical form called a sample routine representation array. This array captures key features in the behavioral data so that the computer system can more easily understand and analyze these phonetic markers.

Next, the embedded mapping component processes the voice interaction sample data using a previously debugged voice processing network. The function of this network is to convert speech data into a mathematical form called a sample speech characterization array. This array also captures key features in the speech data, such as timbre, pitch, pace, etc., which are critical to the recognition and understanding of the user's voice tag. After deriving the sample routine as a token array and the sample speech token array, the embedding mapping component further processes the data. The method determines a core part in the sample speech characterization array, namely the sample core speech characterization array, for the characterization array according to the sample routine. This process can be understood as extracting the most relevant part of the voice tag from the voice data.

Finally, the embedded mapping component combines the sample routine as a characterization array and the sample core speech characterization array in a feature interaction manner to form a more comprehensive characterization, which is called a sample integration characterization array. The array fuses key information of the behavior data and the voice data, and provides powerful support for subsequent model training and reasoning.

Wherein, in the voice markup network, the behavior coding module, the voice processing network and the integrated neural network are all key machine learning model components. The behavioral encoding module may be a convolutional neural network (Convolutional Neural Network, CNN), for example, where the behavioral interaction sample data is a series of image frames that capture gestures or body actions of the user. CNNs are able to extract hierarchical feature representations from the original pixels. Through the combination of multiple convolution layers, pooling layers, and activation functions, the CNN can learn key features of gestures or body actions and encode these features as sample routines as a token array.

The speech processing network may be a recurrent neural network (Recurrent Neural Network, RNN) or a variant thereof, such as a Long Short-Term Memory (LSTM), because of the time-sequential nature of the speech data, i.e. the dependency between the current sound sample and the preceding and following sound samples. RNNs and their variants LSTM are able to capture this timing dependency and effectively handle variable length speech sequences. By stacking multiple RNN or LSTM layers in combination with appropriate input features (e.g., mel-frequency cepstral coefficients MFCCs), the speech processing network can extract key features such as phonemes, syllables, and semantics in the speech and encode these features into sample speech characterization arrays.

The integrated neural network may be a deep neural network (Deep Neural Network, DNN) or a self-encoder (Autoencoder). The integrated neural network is responsible for fusing the behavioral and phonetic characterizations together to form a more comprehensive characterization. The DNN can learn the nonlinear relation between the behavior representation and the voice representation through a multi-layer fully connected neural network and output an embedded representation fused with the information of the behavior representation and the voice representation. The self-encoder is then an unsupervised learning model that can learn an efficient representation of the data by minimizing the difference between the input and the reconstructed output. In an integrated neural network, the encoder portion of the self-encoder may be used to compress the behavioral and phonetic representations into a low-dimensional embedded representation, while the decoder portion may be used to recover the original behavioral or phonetic information from the embedded representation (although in phonetic notation, the decoder portion may not be necessary).

It should be noted that the machine learning model described above is only one of the possible implementations. Indeed, other suitable machine learning models or algorithms may also be selected to implement the behavior encoding module, the speech processing network, and the integrated neural network, depending on the particular application scenario and data characteristics. In addition, these models may also be combined or cascaded to further improve the performance of the voice markup.

Step S30: and loading the sample integration embedded representation array to a restoration mapping component of the initial voice mark network to perform target voice analysis to obtain the reasoning confidence levels of a plurality of reasoning voice mark sets, wherein the plurality of reasoning voice mark sets comprise comparison voice mark sets.

In step S30, the restoration mapping component, i.e. a decoder, may restore the embedded tokens to the original data or extract key information, specifically, when the computer system loads the sample integration embedded token array into the restoration mapping component, the component will parse the embedded tokens using its own structure and parameters. This process can be understood as the decoder is attempting to "interpret" or "translate" these embedded representations to extract information about the target speech therefrom. In the voice markup network, the target voice refers to voice interaction data corresponding to the behavioral interaction sample data. The goal of the restore map component is to parse out key features or tokens of these target voices, which are typically represented in the form of a collection of voice tokens.

In step S30, the restoration mapping component processes the sample integration embedded token array and outputs the inference confidence levels of the plurality of inference voice tag sets. These inferred voice tag sets are the possible target voice tag sets inferred based on the embedded token, including the reference voice tag set (i.e., voice tags corresponding to the original behavioral interaction sample data) as well as other possible tag sets.

The inference confidence level is an indicator that measures the confidence level of each inferred voice token set. It is typically expressed in the form of a probability value or confidence score reflecting the degree of confidence of the restoration mapping component in each of the inference results. The higher the confidence level, the more closely the restoration mapping component considers the reasoning results to be the true set of target phonetic markers. The specific implementation of the restore map component may vary from application scenario to application scenario and from technical solution to technical solution. It may be a rule-based method, a traditional machine learning model (such as a support vector machine, a decision tree, etc.), or a deep learning model (such as a recurrent neural network, a convolutional neural network, etc.). In selecting an appropriate model, factors such as the characteristics of the data, the complexity of the task, and the computing resources need to be considered.

Step S40: an error value is determined based on the inference confidence levels of the plurality of inference voice tag sets.

In step S40, the computer system determines an error value, also referred to as a penalty, based on the inferred confidence levels of the plurality of inferred voice token sets obtained in step S30. The error value measures the difference between the current predicted result and the actual result of the model and is a key basis for optimizing the model and adjusting parameters subsequently. In particular, the inference confidence level reflects the degree of confidence that the model has in each set of inference phonetic markers. In general, the higher the confidence level, the greater the confidence that the model holds for the reasoning result, the more accurate the prediction; conversely, larger errors may exist. In step S40, the computer calculates an overall error value by taking into account the confidence levels of all the inferred voice token sets, and their differences from the reference voice token set (i.e., the true token).

The calculation of this error value can take many forms, depending on the model type and application scenario. For example, in classification tasks, a common error calculation method is a cross entropy loss function, which measures the difference between the probability distribution of model predictions and the true probability distribution. In the regression task, the difference between the model predicted value and the true value may be measured by mean square error or mean absolute error.

This error value reflects not only the behavior of the model on the current sample, but also represents the average performance of the model over the entire training set. The model is optimized continuously, parameters are adjusted to reduce the error value, the generalization capability of the model can be improved, and the model is better suitable for voice instruction recognition tasks in various actual scenes.

Step S50: and optimizing the learnable parameter values embedded in the mapping component and the restoring mapping component respectively according to the error value until meeting the debugging cut-off requirement, and obtaining the voice marking network.

In step S50, the computer system optimizes the learnable parameters embedded in the mapping component and the restoring mapping component according to the error value calculated in step S40, so as to improve the performance of the model. In particular, the embedding mapping component and the restoration mapping component are two key parts in a voice markup network. The embedding mapping component is responsible for converting the original behavioral interaction sample data and voice interaction data into embedded tokens, and the restoring mapping component is responsible for parsing out key information and labels of the target voice from the embedded tokens. Both components contain some learnable parameters, such as weights, biases, etc., whose values directly affect the performance of the model.

In step S50, the computer uses an optimization algorithm to adjust the learnable parameters according to the error values. The choice of optimization algorithm may be determined according to the specific application scenario and model type. Common optimization algorithms include gradient descent, random gradient descent, adam, etc. The algorithm updates the learnable parameter according to the gradient information calculated by the error value, so that the predicted result of the model is more similar to the real result, and the error value is reduced.

This process continues until the performance of the model reaches the preset debug cutoff requirement. The debugging cut-off requirement can be determined according to specific application scenes and requirements, and can be that an error value is smaller than a certain threshold value, or that the performance of the model on a verification set reaches a certain index or the like. When the model meets the debugging cut-off requirement, a final voice marking network can be obtained.

In the above step 50, optimizing the learnable parameter values embedded in the mapping component and the restoring mapping component according to the error values respectively may specifically include:

Step S51: determining a first learning rate of a first learnable parameter according to the error value, wherein the first learnable parameter is a learnable parameter in a voice processing network which is debugged in advance;

Step S52: determining a second learning rate of the second learnable parameter according to the error value, wherein the second learning rate is greater than the first learning rate; the second learnable parameter includes a learnable parameter embedded in the mapping component other than the first learnable parameter and a learnable parameter restored in the mapping component;

Step S53: and optimizing the corresponding learnable parameter according to the first learning rate and the second learning rate respectively.

In step S51, the computer system determines a learning rate of a learnable parameter (i.e., a first learnable parameter) in the pre-commissioned speech processing network according to the error value calculated in step S40. The learning rate is a key parameter in the optimization algorithm, which determines the step size of the parameter at each iteration update. A smaller learning rate may lead to a slow optimization process, while a larger learning rate may lead to an unstable optimization process. The determination of the first learning rate is typically based on the magnitude and trend of the error value. If the error value is large or the descent speed is slow, the computer system may select a smaller first learning rate to ensure the stability of the optimization process; conversely, if the error value is small or the descent speed is fast, a larger first learning rate may be selected to accelerate the optimization process.

Taking the speech processing network as an example, assume that the network is a Recurrent Neural Network (RNN) that can learn parameters including weights and biases. In the initial stage, the computer system may set a first smaller learning rate, e.g., 0.001, for the parameters due to the larger error value. As the optimization process proceeds and the error value decreases, the computer system may gradually increase the first learning rate to accelerate the convergence speed of the network.

In step S52, the computer system determines a second learning rate for the learnable parameters embedded in the map component other than the first learnable parameter and for the learnable parameters in the restore map component (i.e., the second learnable parameter). Unlike the first learning rate, the second learning rate is typically set to a larger value in order to adjust the values of the parameters faster during the optimization process. The determination of the second learning rate is likewise based on the magnitude and trend of the error value, but it is considered that these parameters may be located deeper in the network or directly related to the output layer, and thus their sensitivity to the error value may be higher. By setting a larger second learning rate, the computer system can adjust the values of these parameters faster to reduce the error value and improve the performance of the model.

Taking the embedded mapping component as an example, it is assumed to be a Deep Neural Network (DNN) that can learn weights and biases for parameters including multiple hidden layers. Because these parameters have a greater impact on the performance of the model, the computer system may set a greater second learning rate for them, such as 0.01 or higher. In this way, the values of these parameters can be adjusted faster during the optimization process, thereby improving the performance of the model.

In step S53, the computer system optimizes the corresponding learnable parameter according to the first learning rate and the second learning rate determined in steps S51 and S52, respectively. The optimization process is typically implemented using a gradient descent method or a variation thereof (e.g., random gradient descent method, adam, etc.).

Specifically, for each learnable parameter, the computer system calculates its gradient with respect to the error value (i.e., the extent to which the parameter affects the error value), and then updates the value of the parameter according to the learning rate. The magnitude of the learning rate determines the step size of parameter update: a larger learning rate may result in a rapid change in the parameter values during the optimization process, while a smaller learning rate may result in a slower but more stable change.

By continually iterating the optimization process (i.e., performing step S50 multiple times), the computer system may gradually adjust the values of the learnable parameters embedded in the mapping component and the restoration mapping component to reduce the error value and improve the performance of the voice-markup network. When the performance of the network reaches the preset debugging cut-off requirement (for example, the error value is smaller than a certain threshold value or the performance index on the verification set reaches the requirement), the final voice mark network model can be obtained.

In the step S30, loading the sample integration embedded token array to the restoration mapping component of the initial voice markup network to perform target voice analysis, so as to obtain the inference confidence levels of the multiple inference voice markup sets, which may specifically include: step S31: loading the sample integration embedded characterization array to a restoration mapping component for voice object range reasoning to obtain a voice object range reasoning result; the speech object scope reasoning result comprises a second selected preset speech object of the plurality of preset speech objects.

In step S31, the computer system loads the sample integration embedded token array into the restore map component. The restoration mapping component is responsible for parsing these embedded tokens to determine the range of speech objects contained in the user's speech instructions. The purpose of the voice object range reasoning is to identify the voice objects actually present in the user's voice instructions, which are relevant to the control and operation of the electric toy. For example, in an electronic toy scenario, the preset phonetic objects may include "ha", "stop", "dash", "i don't play", "right turn", and so forth. Through voice object range reasoning, the restoration mapping component analyzes the elements embedded in the sample integration characterization array, and compares and matches the elements with a preset voice object. Finally, the reasoning result is output as a second selected preset voice object which is actually present in the user voice command and is related to the control of the electric toy.

Step S32: and determining corresponding prototype characteristic parameters of the second target voice object in the prototype characteristic parameter set of the voice object by the second selected preset voice object.

Step S33: and carrying out similarity measurement on the sample integration embedded characterization array according to the prototype characteristic parameters of the second target voice object to obtain the target sample integration embedded characterization array corresponding to the second selected preset voice object.

Steps S32 to S33 play a role in the voice markup network, and they convert the result of voice object range reasoning into the target sample integration embedded token array for a specific voice object, thereby laying a foundation for subsequent voice analysis. The following is a detailed explanation of these two steps: in step S32, the computer system searches for a corresponding second target speech object prototype feature parameter in the speech object prototype feature parameter set according to the speech object range inference result output in step S31, i.e. the second selected preset speech object. The prototype feature parameters are predefined and learned for each preset speech object to represent the core features of the speech object.

In step S33, the computer system performs similarity measurement on the elements in the sample integration embedded token array by using the prototype feature parameters of the second target speech object acquired in step S32. This process can be understood as the computer system attempting to find the embedded token that best matches the target speech object. The similarity measure is typically based on some distance measurement algorithm or similarity scoring function, such as euclidean distance, cosine similarity, etc. The computer compares each element (or a small group of elements) in the sample integration embedded token array with the prototype feature parameters of the second target speech object to calculate the similarity between them. Finally, according to the similarity calculation result, the computer system selects the parts most similar to the second target voice object to embed the characterization, and the parts form a target sample integration embedding characterization array. This array is passed as input to the target speech parsing module of the restoration mapping component for further processing and analysis in subsequent steps.

Therefore, the computer system can gradually reduce the attention point, focus on the part closely related to the target voice object from the whole embedded characterization array, and the method based on the prototype characteristic parameters and the similarity measure also enables the system to have certain flexibility and generalization capability, and can meet the voice interaction requirements of different users and different scenes.

Step S34: and loading the target sample integration embedded representation array to a target voice analysis module of the restoration mapping assembly to analyze target voice to obtain the reasoning confidence levels of a plurality of reasoning voice mark sets corresponding to the second selected preset voice object.

In step S34, the computer system takes the target sample integration and embedding token array obtained in step S33 as input, and loads the input into the target speech analysis module of the restoration mapping component for processing. The target speech parsing module is typically a complex machine learning model that is trained to recognize specific patterns and structures in the speech data. Taking a specific implementation manner as an example, the target voice analysis module may be a model formed by BiLSTM (bidirectional long and short time memory network), MLP (multi-layer perceptron) and CRF (conditional random field) which are sequentially connected. BiLSTM are responsible for capturing long-term dependencies in the sequence data, MLP is used to further extract features, and CRF is used to label and classify extracted features. In this process, the target sample integration embedded token array is first fed into the BiLSTM layers. BiLSTM, through its internal memory unit and gating mechanism, can effectively process variable length sequence data and capture context information in the sequence. After BiLSTM processing, a feature sequence containing more abundant context information is output. This feature sequence is then fed into the MLP layer. The MLP can further extract and abstract features through multi-layer nonlinear transformation, so that the output features are more differentiated and representative. The output of an MLP is typically a high-dimensional feature vector that contains a deep-level feature representation of the input data. Finally, this high-dimensional feature vector is fed into the CRF layer. The CRF is a probabilistic graphical model for sequence labeling and classification that can take into account the dependency between adjacent labels in a sequence and output the most likely label sequence and its confidence level. In the context of speech parsing, the CRF may output multiple sets of inferred voice markup, each set corresponding to one possible speech parsing result, with a corresponding confidence level. By utilizing machine learning models such as BiLSTM, MLP, CRF and the like, deep feature extraction and pattern recognition are carried out on the target sample integration embedded characterization array, and finally key information and marks in target voice are analyzed and corresponding confidence level is given.

In the step S31, loading the sample integration embedded token array to the restoration mapping component to perform voice object range reasoning, so as to obtain a voice object range reasoning result, which may specifically include:

Step S311: extracting an embedded characterization array at a position corresponding to a preset classification identification code from the sample integrated embedded characterization array to obtain a sample classification embedded characterization array; the preset classification identification code is an identification code which is arranged at the head of the data of the behavior interaction sample when the generation sample routine is a representation array;

Step S312: embedding sample classification into a characterization array, loading the characterization array into a classification module of a restoration mapping component for mapping classification to obtain reasoning classification information; the reasoning classification information comprises reasoning support coefficients corresponding to each preset voice object in the plurality of preset voice objects, and the reasoning support coefficients represent confidence levels of the corresponding preset voice objects in the behavior interaction sample data;

Step S313: and determining a second selected preset voice object with the inference support coefficient larger than the support coefficient threshold value in the plurality of preset voice objects to obtain a voice object range inference result.

Step S311 is for extracting sample class embedded token arrays, in which the computer system first processes the sample integration embedded token arrays. The array contains embedded characterization information of the original behavior interaction sample data and the voice interaction data, and is a multi-dimensional data structure. The computer system can find the embedded characterization of the corresponding position in the sample integration embedded characterization array according to the preset classification identification code, and extract the characterization to form a new array, namely the sample classification embedded characterization array. The preset classification identifier is an identifier placed in the header of the behavioral interaction sample data when the sample routine is a token array, and is used to identify different types of data or different processing steps. By extracting the embedded token corresponding to the preset classification identifier, the computer system can focus on the part of data most relevant to the current task (voice object range reasoning), thereby improving the processing efficiency and accuracy.

Next, the computer system loads the sample classification embedded token array obtained in step S311 into the classification module of the restoration mapping component. This classification module is typically a machine learning model that is trained to learn how to predict the output (here, the inferred classification information) from the input data (here, the sample classification embedded token array). In a specific embodiment, the classification module may adopt a linear mapping method to map the channel dimensions of the sample classification embedded token array to the number of preset speech objects. This means that each channel corresponds to a specific preset speech object. Then, the mapping result is converted into a range from 0 to 1 by using the Sigmoid activation function, so that the confidence that each preset voice object exists in the behavior interaction sample data is obtained. These confidence levels form part of the inferential classification information.

The reasoning and classifying information also comprises reasoning supporting coefficients corresponding to each preset voice object. The inference support coefficient is a quantization index used for representing the confidence level of the corresponding preset voice object in the behavior interaction sample data. It may help the computer system determine which speech objects are most likely to be present in the current interaction scenario.

Finally, in step S313, the computer system determines which of the plurality of preset speech objects are most likely to exist in the current interaction scenario according to the inference support coefficients. In particular, it compares the inferred support coefficients to a support coefficient threshold. If the inferred support coefficient for a certain preset speech object is greater than the support coefficient threshold, then the preset speech object is considered to be the second selected preset speech object and is included in the speech object range inference result. The support coefficient threshold is a preset threshold used for screening out preset voice objects with higher confidence level. By setting a proper support coefficient threshold, the computer system can reduce false alarm and missing report situations while ensuring the identification accuracy.

The three sub-steps of sampling sample classification embedding characterization array, mapping classification to obtain reasoning classification information, determining a second selected preset voice object and the like are adopted, so that the target of identifying the voice object range from the sample integration embedding characterization array is realized.

Based on this, in the above step S40, an error value is determined according to the inference confidence levels of the plurality of inference voice tag sets, including:

step S41: determining a first error value according to the reasoning confidence levels of a plurality of reasoning voice mark sets corresponding to the second selected preset voice object based on the first error determining function;

step S42: based on a second error determining function, determining a second error value according to the reasoning classification information and the comparison classification priori information corresponding to the behavior interaction sample data, wherein the comparison classification priori information is used for indicating whether each preset voice object in the plurality of preset voice objects exists in the behavior interaction sample data;

Step S43: and determining a total error value according to the first error value and the second error value.

In step S41, the computer system calculates a first error value of the inference confidence level of the plurality of sets of inference phonetic markers corresponding to the second selected preset phonetic object using the first error determination function. The inference confidence level reflects the confidence level of the model for each inferential voice token, typically a probability value between 0 and 1. The first error determination function may be a common error calculation function such as a mean square error (Mean Squared Error, MSE) or Cross entropy loss (Cross-Entropy Loss). Taking the mean square error as an example, it calculates the average of the squares of the differences between the inference confidence level and the real labels. If the inference confidence level differs more from the true tag, the first error value will be higher and vice versa.

Next, the computer system calculates a second error value between the inferential classification information and the collation classification prior information corresponding to the behavioral interaction sample data using a second error determination function. The collation classification a priori information is information indicating whether each of a plurality of predetermined speech objects is present in the behavioral interaction sample data, which is typically provided by a human annotation or other reliable source. The second error determination function may be an index of evaluating classification performance such as Accuracy (Accuracy), precision (Precision), recall (Recall), or F1 score. Taking the accuracy as an example, it calculates the ratio between the number of correctly classified samples and the total number of samples in the inference classification information. If the inferred classification information differs more from the comparison classification a priori information, the second error value will be higher and vice versa.

Finally, in step S43, the computer system determines a total error value according to the first error value and the second error value. The total error value is a comprehensive assessment of the accuracy of the inference result and may be calculated by weighted summing the first error value and the second error value or other suitable combination.

Weighted summation is a common combination that may assign different weights depending on the importance or confidence of each error value. For example, if the error in the inference confidence level is considered more important for model optimization, then a higher weight may be assigned to the first error value; conversely, a higher weight may be assigned to the second error value. By adjusting the weights, the total error value can more accurately reflect the accuracy of the reasoning result.

The accuracy of the reasoning result is evaluated by calculating the first error value, the second error value and the total error value, and guidance is provided for optimizing the model.

In one embodiment, the step S41 above, based on the first error determining function, determines the first error value according to the inference confidence levels of the plurality of inference voice tag sets corresponding to the second selected preset voice object, which may specifically include:

Step S411: when the second selected preset voice object is matched with the comparison classification priori information, the inference confidence level corresponding to the comparison voice mark set in the plurality of inference voice mark sets is made to be the first direction to the maximum, and a first error value is obtained based on a first error determining function according to the first direction and the inference confidence level of the plurality of inference voice mark sets;

Step S412: when the second selected preset voice object is not matched with the comparison classification priori information, the reasoning confidence level corresponding to the target reasoning voice mark set in the multiple reasoning voice mark sets is enabled to be the second direction at the maximum, the first error value is obtained based on the first error determining function according to the second direction and the reasoning confidence level of the multiple reasoning voice mark sets, and each voice object mark in the target reasoning voice mark set represents not any preset voice object.

Since the second selected preset voice object is a preset voice object with the inference support coefficient greater than the support coefficient threshold value in the plurality of preset voice objects, that is, the selected second selected preset voice object is behavior interaction sample data with high confidence level. The comparison classification priori information may indicate which preset voice object behavior interaction sample data, so if the preset voice object of the behavior interaction sample data indicated by the comparison classification priori information contains the second selected preset voice object, the second selected preset voice object is considered to be matched with the comparison classification priori information, otherwise, if the preset voice object of the behavior interaction sample data indicated by the comparison classification priori information does not contain the second selected preset voice object, the second selected preset voice object is considered to be not matched with the comparison classification priori information.

When the second selected pre-set speech object is adapted to the comparison classification prior information, it means that the speech object is present in the real behavioral interaction sample data. In this case, the computer system sets a goal that maximizes the inference confidence level of a set of the plurality of inferred voice token sets that corresponds to the voice object. This level of inference confidence reflects the degree of confidence that the model exists for the speech object.

To achieve this goal, the computer may calculate a first error value using a first error determination function. This function will calculate an error based on the inference confidence level and the set target (i.e., the highest inference confidence level against the set of phonetic markers). In general, if the inference confidence level is higher, the closer to the real situation, the smaller the error value will be; conversely, if the inference confidence level is low or there is a large deviation from the true situation, the error value will be large.

When the second selected preset speech object does not match the comparison classification prior information, it means that the speech object is not present in the real behavioral interaction sample data. In this case, the computer system will set another goal, namely, to maximize the inference confidence level of the set of target inference phonetic markers that characterize "not any one of the preset phonetic objects" among the sets of inference phonetic markers. This is in effect when the model is encouraged to more confidently judge that the speech object is not present.

Likewise, the computer may use the first error determination function to calculate a first error value in this case. This function will calculate the error based on the inference confidence level and the set target (i.e., the target infers that the inference confidence level of the voice tag set is the largest). If the model erroneously believes that the speech object is present or the determination that the speech object is not present is not sufficiently confident, the error value will be large.

In both cases, the specific form of the first error determination function may be a common error calculation function such as a mean square error, a cross entropy loss, and the like, and the specific selection depends on the requirements of the task and the characteristics of the model. By calculating the first error value, the computer system may quantify the performance of the model in terms of inferring the set of phonetic markers, thereby providing guidance for subsequent model optimization.

The embodiment of the present application also provides a computer system (the computer system is, for example, a learning electric toy or a background device in communication with the learning electric toy), as shown in fig. 2, the computer system 100 includes: a processor 101 and a memory 103. Wherein the processor 101 is coupled to the memory 103, such as via bus 102. Optionally, the computer system 100 may also include a transceiver 104. It should be noted that, in practical applications, the transceiver 104 is not limited to one, and the structure of the computer system 100 is not limited to the embodiment of the present application.

The processor 101 may be a CPU, general-purpose processor, GPU, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 101 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 102 may include a path to transfer information between the aforementioned components. Bus 102 may be a PCI bus or an EISA bus, etc. The bus 102 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 2, but not only one bus or one type of bus.

Memory 103 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disks, laser disks, optical disks, digital versatile disks, blu-ray disks, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 103 is used for storing application program codes for executing the inventive arrangements and is controlled to be executed by the processor 101. The processor 101 is configured to execute application code stored in the memory 103 to implement what is shown in any of the method embodiments described above.

The embodiment of the application provides a computer system, which comprises: one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the one or more processors, implement the methods described above.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when run on a processor, enables the processor to perform the corresponding content of the method embodiments described above.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. The learning type electric toy control method based on man-machine interaction is characterized by comprising the following steps of: acquiring a behavior representation array of the behavior interaction data to be analyzed and a voice representation array of the voice interaction data to be analyzed; the behavior interaction data to be analyzed is matching behavior data corresponding to the voice interaction data to be analyzed; determining a core voice characterization array in the voice characterization arrays according to the behavior characterization arrays; the core voice characterization array is a voice characterization array with semantic characterization involved in the behavior characterization array; performing feature interaction according to the behavior characterization array and the core voice characterization array to obtain an integrated characterization array, and embedding the integrated characterization array into a preset feature domain to obtain an integrated embedded characterization array; performing target voice analysis according to the integrated embedded representation array to obtain a target voice mark set; and formulating a target interaction control strategy according to the target voice mark set so as to control the learning type electric toy according to the target interaction control strategy.

2. The method of claim 1, wherein said determining a core speech characterization array of the speech characterization arrays from the behavior characterization arrays comprises: determining a cross-attention influence coefficient according to the behavior characterization array and the voice characterization array; correcting the voice characterization array according to the cross-attention influence coefficient to obtain a core voice characterization array; the feature interaction is performed according to the behavior characterization array and the core voice characterization array to obtain an integrated characterization array, which comprises the following steps: determining an internal attention influence coefficient according to the behavior characterization array; correcting the behavior characterization array according to the internal attention influence coefficient to obtain a core behavior characterization array; and performing characteristic interaction on the behavior characterization array, the core voice characterization array and the core behavior characterization array to obtain an integrated characterization array.

3. The method of claim 1, wherein the performing target voice parsing based on the integrated embedded token array to obtain a set of target voice tags comprises: determining a first selected preset voice object from a plurality of preset voice objects according to the integrated embedded representation array; the first selected preset voice object is a preset voice object in the behavior interaction data to be analyzed; determining a first target voice object prototype feature parameter corresponding to the first selected preset voice object in a voice object prototype feature parameter set, wherein the voice object prototype feature parameter set comprises a voice object prototype feature parameter of each preset voice object in the plurality of preset voice objects; performing similarity measurement on the integrated embedded characterization array according to the prototype characteristic parameters of the first target voice object to obtain a target integrated embedded characterization array corresponding to the first selected preset voice object; and carrying out target voice analysis according to the target integration embedded representation array to obtain a target voice mark set corresponding to the first selected preset voice object.

4. The method of claim 3, wherein determining a first selected one of a plurality of preset speech objects from the integrated embedded token array comprises: extracting an embedded characterization array at a position corresponding to a preset classification identification code from the integrated embedded characterization array to obtain a classification embedded characterization array, wherein the preset classification identification code is an identification code arranged in the head of the behavior interaction data to be analyzed when the behavior characterization array is generated; mapping classification is carried out according to the classification embedded characterization array to obtain classification information, wherein the classification information comprises a support coefficient corresponding to each preset voice object in a plurality of preset voice objects, and the support coefficient characterizes the confidence level of the corresponding preset voice object in the interaction data of the behavior to be analyzed; determining a first selected preset voice object, of which the support coefficient is greater than a support coefficient threshold, from the plurality of preset voice objects; the step of performing similarity measurement on the integrated embedded token array according to the prototype feature parameters of the first target voice object to obtain a target integrated embedded token array corresponding to the first selected preset voice object includes: determining similarity measurement results between each embedded characterization array in the integrated embedded characterization array and the prototype characteristic parameters of the first target voice object; multiplying the similarity measurement result with the embedded characterization array, and splicing the multiplication result with the embedded characterization array to obtain a target integrated embedded characterization array.

5. The method according to claim 3 or 4, wherein the method is performed in accordance with a voice tagging network, the method further comprising a commissioning step of the voice tagging network, comprising: obtaining a debugging learning sample in a debugging learning sample library, wherein the debugging learning sample comprises behavior interaction sample data and corresponding voice interaction sample data, and the behavior interaction sample data corresponds to a comparison voice mark set; loading the debugging learning sample into an embedding mapping component of an initial voice mark network to obtain an output sample integration embedding characterization array; the embedded mapping component is used for determining a sample routine corresponding to the behavior interaction sample data as a representation array according to a behavior coding module which is debugged in advance, determining a sample voice representation array corresponding to the voice interaction sample data according to a voice processing network which is debugged in advance, determining a sample core voice representation array in the sample voice representation array according to the sample routine as the representation array, performing characteristic interaction according to the sample routine as the representation array and the sample core voice representation array to obtain a sample integration representation array, and performing embedding operation on the sample integration representation array according to an integrated neural network which is debugged in advance to obtain the sample integration embedded representation array; loading the sample integration embedded representation array to a restoration mapping component of the initial voice mark network to perform target voice analysis to obtain inference confidence levels of a plurality of inference voice mark sets, wherein the plurality of inference voice mark sets comprise the comparison voice mark set; determining an error value according to the reasoning confidence levels of the plurality of reasoning voice tag sets; and respectively optimizing the learnable parameter values in the embedded mapping component and the restoring mapping component according to the error value until the learnable parameter values meet the debugging cut-off requirement, thereby obtaining the voice marking network.

6. The method of claim 5, wherein optimizing the learnable parameters in the embedding map component and the restoring map component, respectively, as a function of the error value comprises: determining a first learning rate of a first learnable parameter according to the error value, wherein the first learnable parameter is a learnable parameter in the pre-debugging voice processing network; determining a second learning rate of a second learnable parameter according to the error value, wherein the second learning rate is greater than the first learning rate; the second learnable parameter includes a learnable parameter in the embedded map component other than the first learnable parameter and a learnable parameter in the restored map component; and respectively optimizing corresponding learnable parameter values according to the first learning rate and the second learning rate.

7. The method of claim 5, wherein loading the sample integration embedded token array into a restoration mapping component of the initial voice markup network for target voice parsing, obtaining inference confidence levels for a plurality of inference voice markup sets comprises: loading the sample integration embedded characterization array to the restoration mapping component to perform voice object range reasoning so as to obtain a voice object range reasoning result; the voice object range reasoning result comprises a second selected preset voice object in a plurality of preset voice objects; determining the second selected preset voice object, and corresponding second target voice object prototype characteristic parameters in a voice object prototype characteristic parameter set; performing similarity measurement on the sample integration embedded characterization array according to the prototype characteristic parameters of the second target voice object to obtain a target sample integration embedded characterization array corresponding to the second selected preset voice object; and loading the target sample integration embedded representation array to a target voice analysis module of the restoration mapping assembly to analyze target voice to obtain the reasoning confidence levels of a plurality of reasoning voice mark sets corresponding to the second selected preset voice object.

8. The method of claim 7, wherein loading the sample integration embedded token array into the restoration mapping component for speech object scope reasoning, the obtaining speech object scope reasoning results comprises: extracting an embedded characterization array at a position corresponding to a preset classification identification code from the sample integrated embedded characterization array to obtain a sample classification embedded characterization array; the preset classification identification code is an identification code which is arranged at the head of the behavior interaction sample data when the sample routine representation array is generated; embedding the sample classification into a characterization array, and loading the characterization array into a classification module of the restoration mapping assembly to carry out mapping classification to obtain reasoning classification information; the reasoning classification information comprises reasoning support coefficients corresponding to each preset voice object in the plurality of preset voice objects, and the reasoning support coefficients represent confidence levels of the corresponding preset voice objects in the behavior interaction sample data; and determining a second selected preset voice object, of which the reasoning support coefficient is larger than a support coefficient threshold, in the plurality of preset voice objects to obtain a voice object range reasoning result.

9. The method of claim 8, wherein said determining an error value based on inference confidence levels for said plurality of sets of inferred voice markers comprises: determining a first error value according to the reasoning confidence levels of a plurality of reasoning voice mark sets corresponding to the second selected preset voice object based on a first error determining function; determining a second error value based on a second error determining function according to the reasoning classification information and comparison classification priori information corresponding to the behavior interaction sample data, wherein the comparison classification priori information is used for indicating whether each preset voice object in the plurality of preset voice objects exists in the behavior interaction sample data; and determining a total error value according to the first error value and the second error value.

10. The method of claim 9, wherein determining a first error value based on the first error determination function based on the inference confidence levels of the plurality of sets of inference phonetic markers corresponding to the second selected preset phonetic object comprises: when the second selected preset voice object is matched with the comparison classification priori information, the inference confidence level corresponding to the comparison voice mark set in the plurality of inference voice mark sets is made to be the first direction to the maximum, and a first error value is obtained based on the first error determining function according to the first direction and the inference confidence level of the plurality of inference voice mark sets; when the second selected preset voice object is not matched with the comparison classification priori information, the reasoning confidence level corresponding to the target reasoning voice mark set in the multiple reasoning voice mark sets is enabled to be the second direction to the maximum, a first error value is obtained based on the first error determining function according to the second direction and the reasoning confidence level of the multiple reasoning voice mark sets, and each voice object mark in the target reasoning voice mark set represents not any preset voice object.