CN111985751A

CN111985751A - Man-machine chat experience assessment system

Info

Publication number: CN111985751A
Application number: CN201910434308.XA
Authority: CN
Inventors: 宓佳琦; 贾孟华; 韩雅娟; 陈宪涛; 周茉莉; 关岱松
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2020-11-24
Anticipated expiration: 2039-05-23
Also published as: CN111985751B

Abstract

The invention provides a model construction method for artificial intelligence equipment interaction capability evaluation, and belongs to the technical field of human-computer interaction evaluation. The method comprises the following steps: determining different evaluation indexes and the hierarchical relationship of the evaluation indexes, and constructing an evaluation index model related to the evaluation indexes according to the hierarchical relationship; and acquiring data of the artificial intelligence equipment on the evaluation index, and performing parameter estimation on a weight coefficient on the evaluation index model by using the data to obtain the evaluation index model with the weight coefficient value. The invention provides a human-computer interaction evaluation index model, creatively evaluates the human-computer interaction capability through the quantitative data of user experience, has better reliability and validity, reasonable structure and good internal consistency, and particularly can be used for monitoring the human-computer chat service experience.

Description

Man-machine chat experience assessment system

Technical Field

The invention relates to the technical field of human-computer interaction evaluation, in particular to a model construction method for artificial intelligence device interaction capability evaluation, a method for artificial intelligence device interaction capability evaluation, an evaluation data generation method for artificial intelligence device interaction capability evaluation, a predicted user experience scoring method, a service system for artificial intelligence device interaction capability evaluation, a device for artificial intelligence device interaction capability evaluation and a computer-readable storage medium.

Background

The man-machine chat, namely open domain chat and conversation without fixed answers, is the development trend of artificial intelligence machine voice in the future, can fully and smoothly communicate with people with contexts, and is the embodiment of machine voice interaction with higher capability. For the smart speaker, the chat experience will affect the overall experience and experience of the user on the product. At present, the chat service of the intelligent loudspeaker box belongs to the exploration period, and a product party does not know what chat experience process is good. However, currently, the assessment of voice conversations in the industry mainly depends on objective assessment of some behavior data (such as conversation turns and duration) or only assessment of chat conversations (assessment dimension is too rough, such as reasonable reply and the like), and a complete and detailed human-computer chat experience assessment system based on user experience angles is lacked. At present, the quantitative evaluation standard of good human-computer chat experience from the perspective of users is lacked, and a product party is difficult to objectively improve the product experience through the existing research and cannot quantitatively measure whether the current improved product can meet the requirements of the users. Meanwhile, the product side cannot clearly locate the reason and continuously track the chat experience through the overall evaluation of the user on the chat experience of the current product. Therefore, it is necessary to research a set of human-computer chat experience assessment index system based on the smart speaker, so as to: (1) the product side is helped to determine the direction and the force point of the man-machine chat service; (2) helping a product party to continuously monitor the chat experience of the product; (3) and establishing an industrial standard of the human-computer chat experience.

Disclosure of Invention

The embodiment of the invention aims to provide a model construction method for evaluating the interaction capacity of artificial intelligent equipment and a method for evaluating the interaction capacity of artificial intelligent equipment.

In order to achieve the above object, an embodiment of the present invention provides a model building method for evaluating interaction capability of an artificial intelligence device, where the model building method includes:

s1), determining different evaluation indexes and the hierarchical relationship of the evaluation indexes, and constructing an evaluation index model about the evaluation indexes according to the hierarchical relationship;

s2) obtaining data of the artificial intelligence device about the evaluation index, and using the data to carry out parameter estimation about a weight coefficient on the evaluation index model to obtain the evaluation index model with the weight coefficient value.

Specifically, step S1) includes:

s101) determining different evaluation indexes, and acquiring a first evaluation data set of the artificial intelligence equipment about the evaluation indexes, wherein the first evaluation data set has different sub data sets respectively corresponding to each evaluation index;

s102) obtaining the change characteristics of the data in each sub data set in the first evaluation data set and the related characteristics of every two sub data sets, screening the evaluation indexes according to the change characteristics and the related characteristics, obtaining the optimized evaluation indexes and the hierarchical relationship of the optimized evaluation indexes, and forming an evaluation index model of the optimized evaluation indexes with a hierarchical structure.

Specifically, the step S102) of obtaining the variation characteristic of the data in each sub data set in the first evaluation data set and the related characteristic of each two sub data sets includes:

s121) obtaining the variation characteristics of the data in each sub data set in the first evaluation data set according to the first evaluation data set in combination with a Carnot model;

s122) obtaining the correlation characteristics of every two subdata sets in the first evaluation data set according to the first evaluation data set by combining correlation analysis and/or regression analysis.

Specifically, the step S102) of obtaining the hierarchical relationship of the optimization evaluation index includes:

and processing the optimization evaluation index and the first evaluation data set according to a reliability check method, and then performing exploratory factor analysis on the processed optimization evaluation index and the processed evaluation data set to obtain the hierarchical relationship of the optimization evaluation index.

Specifically, step S2) includes:

s201) obtaining a second evaluation data set related to the evaluation index model, performing parameter estimation related to a weight coefficient on the evaluation index model by using the second evaluation data set and combining a preset model, and obtaining the weight coefficient of each optimized evaluation index in the optimized evaluation indexes of the previous level in the evaluation index model, wherein the second evaluation data set has different sub data sets, and each sub data set corresponds to one optimized evaluation index in the last level in the evaluation index model;

s202) updating the evaluation index model by using the weight coefficient to obtain an evaluation index model which has a hierarchical structure and has optimized evaluation indexes of the weight coefficient values.

Specifically, the performing, in step S201), parameter estimation on a weight coefficient of the evaluation index model by using the second evaluation data set in combination with a preset model includes:

And taking the data in the sub data set in the second evaluation data set as a direct observation variable, and carrying out free parameter estimation on the evaluation index model by combining a structural equation model.

Specifically, in step S201), after performing parameter estimation on the evaluation index model by using the second evaluation data set in combination with a preset model and before obtaining the weight coefficient of each optimized evaluation index in the optimized evaluation indexes of the previous hierarchy in the evaluation index model, the method further includes:

and carrying out adaptation degree check on the evaluation index model by adopting a structural equation model, returning to carry out parameter estimation on the weight coefficient again when the evaluation index model cannot pass the adaptation degree check, or taking the weight coefficient value of the parameter estimation as the weight coefficient value of the evaluation index model when the evaluation index model passes the adaptation degree check.

Specifically, the step S2) further includes, after acquiring data of the artificial intelligence device about the evaluation index and before performing parameter estimation about a weight coefficient on the evaluation index model using the data: and processing the data by a data cleaning method and a reliability verification method.

The embodiment of the invention provides a method for evaluating interaction capacity of artificial intelligence equipment, which comprises the following steps:

s1) obtaining current evaluation data of each evaluation index in the last level in the evaluation index model having a hierarchical structure and having a weight coefficient value;

s2) substituting the current evaluation data into the evaluation index model and combining the weight coefficient value to calculate to obtain an evaluation standard score;

s3) obtaining an evaluation collection score corresponding to the evaluation index at the most front level in the evaluation index model, and taking the evaluation standard score as the real score of the interaction capability of the artificial intelligence device when the correlation coefficient of the evaluation collection score and the evaluation standard score meets the threshold condition.

Specifically, when the evaluation index model is a three-level evaluation index model, the three-level evaluation index model is:

wherein ,S_XTo evaluate the criteria, a_iThe weight coefficient value of the ith evaluation index in the second evaluation index in the first evaluation index, b_ijThe weight coefficient value of the j evaluation index of the third evaluation index in the i evaluation index of the second evaluation index is F_ijFor the current evaluation data of each evaluation index in the third-level evaluation index, W _sumM, N is a positive integer for the normalized sum of weights.

The embodiment of the invention provides a migration method for an artificial intelligence equipment interaction capability evaluation index model, which comprises the following steps:

s1) obtaining the current evaluation data of the current user group corresponding to the evaluation index of the source evaluation index model;

s2) updating the evaluation index and the hierarchical relationship of the target evaluation index model according to the evaluation index and the hierarchical relationship of the source evaluation index model to obtain an updated target evaluation index model, and determining the weight coefficient of each level of the updated target evaluation index model according to the current evaluation data.

The embodiment of the invention provides an evaluation data generation method for artificial intelligence equipment interaction capability evaluation, which comprises the following steps:

s1) acquiring a log of the interaction behavior of the artificial intelligence device and the user;

s2) selecting observed users, and forming a big data set by logs corresponding to all the observed users;

s3), determining evaluation indexes for direct observation, and generating evaluation data corresponding to each evaluation index by judging the relationship between the big data set and a preset rule set corresponding to each evaluation index respectively and combining with a preset scoring rule.

The embodiment of the invention provides a method for predicting user experience scores, which comprises the following steps:

s1) labeling the corresponding relation between the scores of the observed user group and the log data about the interaction behavior of the artificial intelligence device and the observed user group corresponding to each evaluation index in the source evaluation index model, and acquiring training evaluation data with the corresponding relation;

s2) training the source evaluation index model by using a transfer learning method in combination with the training evaluation data, and obtaining a target evaluation index model after training is completed;

s3) obtaining current log data of the interaction behavior of the artificial intelligence device and the observed current user or the observed current user group, and predicting the score of the observed current user or the observed current user group through the target evaluation index model by utilizing the target evaluation index model and combining the current log data.

Specifically, step S1), wherein the correspondence is obtained by:

firstly, setting a mapping rule and a scoring rule corresponding to each evaluation index, and selecting the current evaluation index;

secondly, obtaining the degree of the log data meeting the mapping rule corresponding to the current evaluation index, and determining the score of the log data according to the degree and the scoring rule corresponding to the current evaluation index.

The embodiment of the invention provides a service system for evaluating the interaction capacity of artificial intelligence equipment, which comprises:

a calculation device for performing evaluation index model calculation having a hierarchical structure and having weight coefficient values and for generating a data table having each evaluation index in a last level in the evaluation index model;

the acquisition equipment is used for presenting the data table to a user terminal, sending the data table fed back by the user terminal to the computing equipment and acquiring an evaluation acquisition score corresponding to an evaluation index at the forefront level in the evaluation index model;

and the computing device is used for taking the evaluation standard score as a real score of the interaction capability of the artificial intelligence device when the correlation coefficient of the evaluation acquisition score and the evaluation standard score meets a threshold condition.

In another aspect, an embodiment of the present invention provides an apparatus for evaluating interaction capability of an artificial intelligence apparatus, including:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implements the aforementioned method by executing the instructions stored by the memory.

In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer instructions, which, when executed on a computer, cause the computer to perform the foregoing method.

The invention provides a human-computer interaction evaluation index model, which creatively evaluates human-computer interaction capacity through quantitative data of user experience, has better reliability and validity, reasonable structure and good internal consistency, and can be particularly used for monitoring human-computer chat service experience;

the evaluation index model constructed by the invention has stability on evaluation index dimension and evaluation index structure levels, can calculate the man-machine chat experience standard score of a product by collecting a certain amount of satisfaction degree and evaluation index data of a specific user group and combining the weight coefficient of the model, and the standard score of the model can be used as a reference for optimizing product experience and can also be used as a basis for judging whether the requirements of the specific user group are met.

The invention also creatively realizes the grade of the predicted user interaction experience, and on the basis that the evaluation index model is used as the source evaluation index model, the grade of the observed user or user group to the experienced artificial intelligent equipment is conjectured by reversely combining the log data unprecedentedly by utilizing the corresponding relation between the log data reflecting the user interaction behavior experience and the grade acquired by the evaluation index model.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a schematic flow chart of a main method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of KANO model attribute classification according to the embodiment of the present invention;

FIG. 3 is a diagram illustrating correlation coefficients of evaluation indicators according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating multiple collinearity coefficients for an evaluation index according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating analysis results of a main body element according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a factor matrix after rotating the analysis result value of the main body element according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating two factors obtained from exploratory factor analysis according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating an evaluation index hierarchy of model A according to an embodiment of the present invention;

FIG. 9 is a free parameter estimation diagram of model A according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating an evaluation index hierarchy of model B according to an embodiment of the present invention;

FIG. 11 is a free parameter estimation diagram of model B according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating a basic fitness test and an intrinsic structural fitness test of a model A according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating an overall model fitness test of model A according to an embodiment of the present invention;

FIG. 14 is an evaluation index model with a hierarchical structure and weight coefficient values according to an embodiment of the present invention;

FIG. 15 is a list of evaluations having a hierarchical structure and weighting factor values according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

Example 1

A model construction method for artificial intelligence device interaction capability assessment, the model construction method comprising:

s2) obtaining data of the artificial intelligence device about the evaluation index, and performing parameter estimation about a weight coefficient on the evaluation index model using the data to obtain an evaluation index model with a weight coefficient value, as shown in fig. 1;

and taking the artificial intelligence equipment as an intelligent sound box, wherein the interaction capacity is the voice interaction capacity.

One way to acquire data may be an evaluation data generation method for artificial intelligence device interaction capability evaluation, the evaluation data generation method including:

s3) determining evaluation indexes for direct observation, and generating evaluation data corresponding to each evaluation index by judging the relationship between the big data set and a preset rule set corresponding to each evaluation index respectively and combining with a preset scoring rule;

The observed user may be selected according to the age of the user, the preset rule set may correspond to a certain evaluation index, for example, ten preset rules exist, when all data in the large data set conform to the ten preset rules, the score of the certain evaluation index is ten, and the certain evaluation index corresponds to one piece of evaluation data, where the preset score rule is one score when the preset score rule conforms to one preset rule, for example, the evaluation index is taken to be humorous, one of the preset rules is "laughter exists" in a large number of user logs (the laughter may be determined by audio analysis or natural language processing after audio reaches text, and whether the laughter exists in log recording by an artificial intelligence device), and the score is obtained.

As one way of acquiring the data, specifically, the step S1) includes:

s102) obtaining the change characteristics of the data in each sub data set in the first evaluation data set and the related characteristics of each two sub data sets, screening the evaluation indexes according to the change characteristics and the related characteristics, obtaining optimized evaluation indexes and the hierarchical relationship of the optimized evaluation indexes, and forming an evaluation index model of the optimized evaluation indexes with a hierarchical structure;

The first evaluation data set or other evaluation data can embody the experience of the user on the current artificial intelligence machine and the potential demand direction of the artificial intelligence machine.

Specifically, for determining different evaluation indexes in step S101), the different evaluation indexes may be determined by:

first, a framework is fully collected and built: obtaining an evaluation index for measuring the chatting experience of the screen-free sound box according to interdisciplinary theories (such as a phonology, a sentence law, a semantic science, a pragmatics and the like), machine voice research and a chatting experience evaluation method of an intelligent product, wherein the evaluation index is specifically an evaluation index for directly providing scoring data by a plurality of users in the aspect of interaction experience elements which can be sensed by the plurality of users;

then, screening adjustment: screening a plurality of evaluation indexes according to the technical characteristics (such as implementation cost) of a product (an artificial intelligent machine, such as an intelligent sound box), the development status and the trend of the product, factors such as easiness in perception and understanding of the evaluation indexes by a user and the like;

then, the user can be subjected to a sampling survey of the understanding degree (scoring from 1 to 10) of the evaluation indexes, for example, the score of the evaluation indexes which cannot be understood (for example, the evaluation indexes: the conversation is rhythmic) is 1, so that the evaluation indexes are screened and optimized again, as shown in tables 1 and 2;

Table 1 element aspects and evaluation indices

TABLE 2 description of evaluation index and corresponding evaluation index

s122) obtaining the correlation characteristics of every two subdata sets in the first evaluation data set according to the first evaluation data set by combining correlation analysis and/or regression analysis;

for the first evaluation data, the chat experience overall satisfaction is used as a first-level evaluation index, the evaluation index in the first column in table 2 is used as a third-level evaluation index, an evaluation list is formed, and after a certain amount of users experience the chat function of the intelligent sound box, the first evaluation data is obtained by grading the evaluation list and the like, as shown in table 3;

TABLE 3 evaluation List

Classifying attributes by combining a Carnot KANO model with evaluation data, and firstly classifying the attributes into four categories, namely a necessary attribute, an expected attribute, an attractive attribute and a non-difference attribute; the essential attributes are: when the attribute does not meet the requirements of the user, the user is not satisfied, but when the requirements of the user are met, the user does not feel satisfied; the expected attribute is as follows: the user satisfaction is in direct proportion to the requirement satisfaction degree of the attribute; charm property: when the attribute does not meet the user requirements, the user does not feel dissatisfied, but when the user requirements are met, the user feels very satisfied; non-difference attributes: the satisfaction of the attribute does not affect the user experience; and respectively determining the priority of all attribute classifications, such as mandatory attribute > expected attribute > charm attribute > non-difference attribute; satisfaction can be quantified by a setter coefficient (vertical axis of fig. 2) and a word coefficient (horizontal axis of fig. 2); the Better coefficient represents a satisfaction degree coefficient after a certain evaluation index is added, wherein the satisfaction degree is obviously increased after the evaluation index is added when the Better coefficient is more than 50 percent, and the satisfaction degree is not obviously changed after the evaluation index is added when the Better coefficient is less than 50 percent; the Worse coefficient represents the absolute value of the satisfaction degree coefficient after deleting a certain evaluation index, more than 50 percent represents that the satisfaction degree is obviously reduced after deleting the certain evaluation index, and less than 50 percent represents that the change of the satisfaction degree is not obvious after deleting the certain evaluation index; here, the non-difference attribute indexes in the evaluation indexes can be analyzed, and the non-difference attribute indexes are deleted from the evaluation list, the evaluation indexes belonging to the non-difference attribute indexes are, for example, "content popular and easy to understand" (all the current intelligent sound boxes have limitations, and evaluation data are embodied as non-difference), "speed is moderate" (the relevance with the user individual is too strong, and the group is non-difference), "state feedback is timely and effective" (the relevance with the user individual state is too strong, for example, working, and the group is non-difference), as shown in fig. 2, the score is claimed importance, 1-10, 1 represents that the score is very unimportant, 10 represents that the score is very important, and the horizontal coordinate is the absolute value of the word coefficient;

Calculating a correlation coefficient between every two evaluation indexes, marking out an evaluation index with the correlation coefficient larger than 0.8, marking out an evaluation index with the tolerance value smaller than 0.2 by using a multiple collinearity method between every two evaluation indexes, merging the evaluation indexes in the evaluation list, for example, merging the ' leaving the talk "and the ' opening a new topic ' into the ' opening a new topic ', merging the ' content correlation ' and the ' understanding ' into the understanding answer correlation, as shown in fig. 3 and 4; the evaluation indexes after optimization are shown in table 4.

Table 4 evaluation list with optimized evaluation index

processing the optimization evaluation index and the first evaluation data set according to a reliability check method, and then performing exploratory factor analysis on the processed optimization evaluation index and the processed evaluation data set to obtain a hierarchical relation of the optimization evaluation index;

the evaluation data of the user experience scores in the evaluation list are preprocessed and judged before exploratory factor analysis is carried out, a Cronbach coefficient (Cronbach's alpha, alpha is a statistic) is used as a reliability analysis measure, for example, setting the first threshold value to 0.8, where α of the data in the evaluation list is 0.954 (greater than 0.8), indicates that the confidence of the evaluation index corresponding to the evaluation data is very good, and then performing the structural validity analysis, the evaluation data of the user experience scores in the evaluation list were processed and judged by the KMO (Kaiser-Meyer-Olkin) test, using the KMO statistic as a measure of structural validity and the Bartlett's test (Sig. statistic), for example, setting the second threshold to 0.9 and the third threshold to 0.05, KMO of the data in the evaluation list to 0.928 and sig. < 0.05 indicates that the evaluation data and the evaluation index are well suited for exploratory factor analysis;

For the exploratory factor analysis process, the main body element analysis is carried out on the evaluation data and the evaluation indexes, then the factors in the main body element analysis are rotated by adopting a maximum variance method (the evaluation indexes combined are selected to be subjected to the exploratory factor analysis with a large satisfaction degree coefficient; the evaluation data before modification is used for the modified evaluation indexes), the factors with characteristic values larger than 1 are extracted, and finally two factors are extracted to obtain the cumulative variance contribution rate 68.043%, for example, the matrix in fig. 5 and fig. 6 (the result of the factor analysis is more visually seen in fig. 6, the factor load is only displayed by blank space with the factor load less than 0.4), and the matrix in fig. 6 is a factor matrix obtained by rotating the analysis result values of the part of fig. 5 by utilizing the maximum variation method normalized by Kaiser;

by observing the three-level index (factor load is more than 0.4) mainly contained by two factors, the following results are found:

1. the first factor mainly comprises: evaluation indexes related to comprehension, conversation persistence, content quality and content expression mode;

2. the second factor mainly comprises: evaluation indexes related to sound and content expression modes;

the form shown in fig. 7 is obtained by defining qualitative ranges of evaluation indexes (such as 'affinity friendly', 'humorous interest', 'human setting', 'natural fluency expression') related to content expression modes, adjusting the structures of partial factors by combining semantic and voice processing experiences, defining a 'content expression mode related index evaluation index' as a second factor, namely deleting the 'content expression mode related evaluation index' in the first factor.

And (3) analyzing the evaluation index according to the exploratory factor and optimizing the factor analysis result again by combining qualitative experience to form a 4-factor evaluation model (as shown in FIG. 8). Combining the hierarchical relationship of the existing evaluation indexes in each factor to form two evaluation index models with hierarchical structures, namely a model A (four second-level factors, shown in figure 8) and a model B (two second-level factors, shown in figure 10) to be verified;

model A: the method comprises the following steps of (1) first-level evaluation indexes (integral satisfaction degree of chat experience), second-level evaluation indexes (understanding and conversation continuity, content quality, content expression modes and sounds corresponding to the integral satisfaction degree of the chat experience) and third-level evaluation indexes (corresponding relation between the second-level evaluation indexes and the second-level evaluation indexes needs to be expanded) [ relevant understanding and answering (understanding) corresponding to the understanding and conversation continuity, inconsistent answering (logic), context contact and new topic opening ], real and reliable content, rich content and valuable content corresponding to the content quality, natural and smooth expression, diversified expression, friendly affinity, humorous interest, human setting and appeasing motivation (situational conversation) corresponding to the content expression modes, and natural and person-like and good tone and color listening' ] corresponding to the sounds; then, step S2) is performed to estimate the free parameters as shown in fig. 9, where r and e are parameter items, and the numbers beside the arrows represent normalized path coefficients;

Model B: the method comprises the following steps of (1) first-level evaluation indexes (integral satisfaction degree of chat experience), (b) second-level evaluation indexes (understanding and conversation continuity and content quality corresponding to the integral satisfaction degree of chat experience and content expression mode and sound corresponding to the integral satisfaction degree of chat experience), (c) understanding and answering correlation (understanding and answering), inconsistent answering (logic), connection context, opening of new topics, real and reliable content, rich content and valuable content corresponding to the integral satisfaction degree of chat experience and the third-level evaluation indexes (natural and smooth expression, diversified expression, affinity, friendliness, humorous interest, human setting, appeasing stimulation (situational conversation), natural and human-like voice and good voice corresponding to the expression mode and sound); then, step S2) is performed to estimate the free parameters as shown in fig. 11, where r and e are parameter items, and the numbers beside the arrows represent normalized path coefficients.

Specifically, step S2) includes:

Specifically, the step S2) further includes, after acquiring data of the artificial intelligence device about the evaluation index and before performing parameter estimation about a weight coefficient on the evaluation index model using the data: processing the data by a data cleaning method and a reliability verification method;

acquiring a second evaluation data set by using an evaluation list with optimized evaluation indexes, performing free parameter estimation on the model A and the model B by using a second-order structural equation model in combination with the second evaluation data set to obtain respective standardized path coefficients of the model A and the model B, and performing adaptation degree inspection on the model A and the model B respectively; specifically, whether the model is meaningful or not is judged according to free parameter estimation of the model A and the model B, if the model is more than 1, the corresponding model is abandoned, and then the meaningful model is subjected to adaptation degree inspection; in the case of data collected by using the present embodiment (sex ratio is 1 to 1, age coverage is 20 to 40, and ratio of more experience and less experience of the sample user smart speaker is 1 to 1), the normalized path coefficient of the model B is greater than 1, so the model a is selected as the final evaluation index model in which a certain weight coefficient value is equal to the square of the corresponding normalized path coefficient, as shown in fig. 14, for example, the normalized path coefficient of "understanding and conversation persistence" and "understanding answer correlation" is 0.75, and the weight coefficient of "understanding and conversation persistence" and "understanding answer correlation" is 0.75 ²(i.e., about 0.56); free parameter estimation Using Generalized least squares estimation (GLS estimation), free parameters may be wrappedComprises the following steps: the normalization factor load quantity, the normalization path coefficient, the variation quantity of the measurement error variable, the square of the normalization factor load quantity, the variation quantity of the residual error term of the internal cause latent variable and the variance of the external cause latent variable; the reason for using the second-order structural equation model is that there is a highly covariant relationship between the evaluation indexes in the second-order evaluation index (for example, the correlation coefficient is greater than a preset threshold value and is regarded as having a highly covariant relationship); as shown in FIGS. 12 and 13, the fitness test can be comprehensively judged by three aspects, namely, basic fitness test, intrinsic fitness test, overall model fitness (including absolute fitness statistic, value-added fitness statistic, and reduced fitness statistic), wherein chi-square value, RMSEA, GFI, AGFI, ECVI, and NCP are absolute fitness statistics, NFI, RFI, IFI, TLI, and CFI are value-added fitness statistics, and PGFI, PNFI, chi-square degree-of-freedom ratio, AIC, and CAIC are reduced fitness statistics.

After the weight coefficient value of the model A is obtained, the overall satisfaction (namely the evaluation standard score) is obtained by combining with the evaluation data, the correlation coefficient (the correlation coefficient is 0.85 for the model A and the extremely strong correlation level is reached) is calculated according to the overall satisfaction and the overall score (namely the evaluation acquisition score) of the user group directly on the overall satisfaction of the chat, and when the correlation coefficient is more than 0.8, the prediction capability of the evaluation index model on the evaluation acquisition score is better, as shown in FIG. 14; updating the evaluation list, as shown in fig. 15, and the process of the evaluation list and finding the evaluation list may be referred to as a "human-machine chat experience evaluation index system".

Example 2

Based on embodiment 1, the evaluation index model in embodiment 1 may be selected as the evaluation index model here, and used in a method for evaluating interaction capability of artificial intelligence equipment, where the method includes:

wherein ,S_XTo evaluate the criteria, a_iThe weight coefficient value of the ith evaluation index in the second evaluation index in the first evaluation index, b_ijThe weight coefficient value of the j evaluation index of the third evaluation index in the i evaluation index of the second evaluation index is F _ijFor the current evaluation data of each evaluation index in the third-level evaluation index, W_sumIn this embodiment, M is 4, and N is ∈ {2,3,4,6 }; as in fig. 14, the respective weight coefficient values of the evaluation index model are: second-level assessment indicators account for first-level assessment indicators (overall chat experience satisfaction) [ "comprehension and conversation continuity (0.78), content quality (0.70), content expression mode (0.95), and sound (0.53) ]"]And the third-level evaluation index accounts for the second-level evaluation index, and the third-level evaluation index has the correlation with the understandable answer (understandable corresponding to the comprehension and the conversation persistence)](0.56), answer inconsistent [ logical](0.50), contact context (0.41) and open new topic (0.64) ", true and reliable content (0.20), rich content (0.47) and valuable content (0.64)" corresponding to the quality of content, natural fluency of expression (0.54), expression diversification (0.41), affinity friendly (0.45), humorous interest (0.78), human setting (0.70) and soothing incentive [ situational dialogue and dialogue incentive ] corresponding to the expression mode of content](0.66) ", and" sound is natural like a person (0.49) and timbre is good (0.57) "corresponding to the sound.

Example 3

Based on embodiment 1, the method for predicting user experience scores comprises the following steps:

Specifically, step S1), wherein the correspondence is obtained by:

secondly, obtaining the degree of the log data meeting the mapping rule corresponding to the current evaluation index, and determining the score of the log data according to the degree and the score rule corresponding to the current evaluation index;

Based on the embodiment 1 and the embodiment 2, the observed user may be selected according to the age of the user, the mapping rule may be a score (score rule: degree ratio score is 1: 1) of a certain evaluation index being ten (score rule: degree ratio score is 1: 1) when all data in the large data set conform to ten preset rules (degree is ten), and the score is a score of one evaluation index corresponding to the certain evaluation index, wherein the score rule is a score obtained when the preset score rule conforms to one preset rule, for example, the evaluation index is humorous, one of the preset rules is "laughter exists" in a large number of user logs (laughter may be determined through audio analysis or natural language processing after audio reaches text, and whether laughter exists in log recording by an artificial intelligence device), and the score is obtained;

in the training process and the using process, the logs can be classified according to the evaluation indexes to obtain classified log data, the classified log data are normalized, then the model is set with loss, the loss can be obtained by taking the normalized log data determined by the marked scores as the current input scores and then obtaining the difference value of the current output scores corresponding to the normalized log data, and the loss can be used for adjusting the weight coefficient value and can also be called as the weight coefficient value loss;

Specifically, the source evaluation index model is a multi-level evaluation index structure, and the evaluation index model with the weight coefficient value can be trained through evaluation data with labels (log data: dialog error times, dialog duration and the like) and currently measured evaluation data (for example, the evaluation index model is three-level, when training, the weight coefficient value of the evaluation index of the second level is locked, the weight coefficient value of the evaluation index of the third level is trained, the learning rate can be set through weight coefficient loss, then the weight coefficient value of the evaluation index of the third level is locked, and the weight coefficient value of the evaluation index of the second level is trained), and the evaluation index model after migration is realized after training is completed.

Alternatively, the source evaluation index model may be a neural network model pre-trained based on a large amount of classified normalized log data and a large amount of collection score sets, the labeled sample may still be the normalized log data corresponding to the score and classification of a certain evaluation index, and correspondingly, the target evaluation index model may also be a migration neural network model of the neural network model relative to the labeled sample;

The classified log data may be obtained by analyzing logs of the artificial intelligent device according to predefined rules, for example, the logs include a conversation scene log, a conversation audio analysis log, a natural language processing log from conversation audio to text, a conversation error occurrence frequency log, a conversation persistence theory log, a conversation duration log, and the like, and under the predefined rules, data of the conversation audio analysis log or the natural language processing analysis log from conversation audio to text may be used to measure "humorous interest" in the evaluation index and used as a basis for log classification, and data of the analysis logs may be formed into classified log data with marks.

Example 4

Based on the embodiment 1, the source evaluation index model or the target evaluation index model conforms to the structure of the evaluation index model in the embodiment 1, and is used for the migration method of the artificial intelligence device interaction capability evaluation index model, and the method comprises the following steps:

s2) updating the evaluation index and the hierarchical relationship of the target evaluation index model according to the evaluation index and the hierarchical relationship of the source evaluation index model to obtain an updated target evaluation index model, and determining the weight coefficient of each level of the updated target evaluation index model according to the current evaluation data;

The migration process can quickly realize the discovery of the requirements of a specific user group (for the evaluation data of a product, users score according to the requirements of the users) and the experience evaluation; for prototype products, the difference between the prototype products and the actual requirements of users can be found through the evaluation index model, evaluation data (especially for designing and selling user objects) is obtained before batch production, the evaluation of the prototype products is completed, and various costs for changing the products in the later period can be greatly reduced; for market products, the evaluation index model can also quickly find the differences with the requirements of users, the technical direction and indexes of the product to be changed, whether the change of the product improves the user experience and the like.

Example 5

Based on embodiment 1, the evaluation index model here conforms to the structure of the evaluation index model in embodiment 1, and is used in a service system for evaluating interaction capacity of artificial intelligence equipment, the service system comprising:

the computing device is used for taking the evaluation standard score as a real score of the interaction capacity of the artificial intelligence device when the correlation coefficient of the evaluation acquisition score and the evaluation standard score meets a threshold condition;

the computing equipment and the acquisition equipment can be independently arranged in different servers, can also be arranged in the same service, and can also be arranged in a cloud server formed by pooling a plurality of servers; in many practical cases, it is also possible to further include a storage device or a database to be able to record or collect a plurality of evaluation index models corresponding to different user groups;

the user terminal may be an artificial intelligence device, or may be a mobile terminal having an application program interacting with the artificial intelligence device, such as a mobile phone, a tablet, a computer, and a smart watch.

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solutions of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications all belong to the protection scope of the embodiments of the present invention.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention do not describe every possible combination.

Those skilled in the art will understand that all or part of the steps in the method according to the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In addition, any combination of various different implementation manners of the embodiments of the present invention is also possible, and the embodiments of the present invention should be considered as disclosed in the embodiments of the present invention as long as the combination does not depart from the spirit of the embodiments of the present invention.

Claims

1. A model construction method for evaluating interaction capacity of artificial intelligence equipment is characterized by comprising the following steps:

2. The model building method for artificial intelligence device interaction capability assessment according to claim 1, wherein step S1) comprises:

3. The method of claim 2, wherein the obtaining of the variation characteristics of the data in each sub data set and the related characteristics of each two sub data sets in the first evaluation data set in step S102) comprises:

4. The model construction method for evaluating the interaction capability of the artificial intelligence device according to claim 2, wherein obtaining the hierarchical relationship of the optimized evaluation index in step S102) comprises:

5. The model building method for artificial intelligence device interaction capability assessment according to claim 1, wherein step S2) comprises:

6. The model construction method for evaluating the interaction capability of the artificial intelligence device according to claim 5, wherein the performing of the parameter estimation on the evaluation index model with the second evaluation data set in combination with the preset model in step S201) includes:

7. The model construction method for evaluating interaction capability of artificial intelligence device according to claim 5, wherein in step S201), after performing parameter estimation on the evaluation index model by using the second evaluation data set in combination with a preset model and before obtaining the weight coefficient of each optimized evaluation index in the optimized evaluation indexes of previous levels in the evaluation index model, the method further comprises:

8. The model building method for evaluating the interaction capability of the artificial intelligence device according to claim 1, wherein the step S2) further comprises, after acquiring the data of the artificial intelligence device about the evaluation index and before performing the parameter estimation about the weight coefficient on the evaluation index model using the data: and processing the data by a data cleaning method and a reliability verification method.

9. A method for artificial intelligence device interaction capability assessment, the method comprising:

10. The method for artificial intelligence device interaction capability assessment according to claim 9, wherein when said assessment index model is a three-level assessment index model, said three-level assessment index model is:

wherein ,S_XTo evaluate the criteria, a_iThe weight coefficient value of the ith evaluation index in the second evaluation index in the first evaluation index, b_ijThe weight coefficient value of the j evaluation index of the third evaluation index in the i evaluation index of the second evaluation index is F _ijFor the current evaluation data of each evaluation index in the third-level evaluation index, W_sumM, N is a positive integer for the normalized sum of weights.

11. A migration method for an artificial intelligence device interaction capability assessment index model is characterized by comprising the following steps:

12. An evaluation data generation method for evaluating interaction capability of artificial intelligence equipment is characterized by comprising the following steps:

13. A method for predictive user experience scoring, the method comprising:

14. The method according to claim 13, wherein step S1) comprises the steps of:

15. A service system for interactive capability assessment of artificial intelligence devices, the service system comprising:

16. An apparatus for artificial intelligence device interaction capability assessment, comprising:

At least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of any one of claims 1 to 14 by executing the instructions stored by the memory.

17. A computer readable storage medium storing computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 14.