CN118051652A

CN118051652A - Data processing method and device, storage medium and electronic equipment

Info

Publication number: CN118051652A
Application number: CN202211426856.6A
Authority: CN
Inventors: 曹源
Original assignee: TCL Technology Group Co Ltd
Current assignee: TCL Technology Group Co Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2024-05-17

Abstract

The application discloses a data processing method, a device, a storage medium and electronic equipment, wherein the method comprises the following steps: a dialogue sample is obtained, wherein the dialogue sample comprises a plurality of dialogue sentences; determining content characteristics of each dialogue sentence and dialogue association characteristics among a plurality of dialogue sentences according to the dialogue samples; constructing a target graph structure corresponding to the dialogue sample according to the content characteristics and the dialogue association characteristics; and training the preset model by utilizing the target graph structure to obtain a trained model, so that model training can be performed by combining dialogue content and dialogue association relation, the recognition capability of the model to the dialogue is effectively improved, and the man-machine interaction effect is further improved.

Description

Data processing method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a data processing method, a device, a storage medium, and an electronic apparatus.

Background

The main current man-machine interaction system (google ASSISTATE, hundred degrees and the like) mainly converts a voice dialogue into a text, then identifies the text, and performs some operations based on the identification result, such as opening an air conditioner, playing music and the like, so as to realize natural language interaction between the machine and human beings, which is a typical single-mode interaction mode.

However, the multi-modal interaction is the most natural mode of man-machine interaction, but the existing man-machine interaction system has low recognition accuracy and poor recognition effect when processing and recognizing multi-modal interaction data.

Disclosure of Invention

The invention provides a data processing method, a data processing device, a storage medium and electronic equipment, which can provide a dialogue processing model with high recognition accuracy.

The embodiment of the application provides a data processing method, which comprises the following steps:

obtaining a dialogue sample, wherein the dialogue sample comprises a plurality of dialogue sentences;

determining content characteristics corresponding to each dialogue sentence and dialogue association characteristics among the dialogue sentences according to the dialogue samples;

constructing a target graph structure corresponding to the dialogue sample according to the content characteristics and the dialogue association characteristics;

And training the preset model by using the target graph structure to obtain a trained model.

The embodiment of the application also provides a data processing device, which comprises:

The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a dialogue sample, and the dialogue sample comprises a plurality of dialogue sentences;

the determining module is used for determining the content characteristics corresponding to each dialogue statement and dialogue association characteristics among the dialogue statements according to the dialogue samples;

The construction module is used for constructing a target graph structure corresponding to the dialogue sample according to the content characteristics and the dialogue association characteristics;

and the training module is used for training the preset model by utilizing the target graph structure to obtain a trained model.

In some embodiments, the dialog sample is a video sample, and the determining module is specifically configured to:

extracting a sub video corresponding to each dialogue statement from the video sample;

Determining text features, audio features and image features corresponding to the dialogue sentences according to the sub-videos;

and splicing at least two of the text features, the audio features and the image features to obtain the content features, wherein each dialogue statement corresponds to one content feature.

In some embodiments, the building block is specifically configured to:

taking each content characteristic as a node;

determining associated nodes corresponding to each node from all the nodes according to the dialogue associated characteristics;

And connecting each node with the corresponding associated node to construct a target graph structure corresponding to the dialogue sample.

In some embodiments, the dialogue association feature includes a dialogue order and a dialogue logical relationship, and the building module is specifically configured to:

Sorting the nodes according to the dialogue sequence;

Acquiring any two nodes in adjacent sorting positions, and taking the node in the latter sorting position as an associated node of the node in the former sorting position;

And acquiring any two nodes which are not in adjacent ordering positions and correspond to the dialogue logic relationship indication as response relationship, and taking the node corresponding to the responded party in the response relationship as an associated node of the node corresponding to the responded party.

In some embodiments, the building block is specifically configured to:

connecting each node with the corresponding associated node through a directional connecting edge to obtain a graph structure corresponding to the dialogue sample, wherein the directional connecting edge comprises an arrow end and a non-arrow end, and the arrow end is connected with the corresponding associated node at each node;

And determining a target graph structure according to the graph structure.

In some embodiments, the building block is specifically configured to:

Counting the out-degree value and the in-degree value of each node in the graph structure, wherein the out-degree value is the number of the non-arrow ends connected with the corresponding node, and the in-degree value is the number of the arrow ends connected with the corresponding node;

and updating the graph structure according to the out-degree value and the in-degree value to obtain a target graph structure.

In some embodiments, the building block is specifically configured to:

When the first difference value between the out-degree value and the in-degree value is equal to a first threshold value, deleting the corresponding node and the directional connecting edge connected with the corresponding node;

When the incoming degree value is smaller than a second threshold value and the outgoing degree value is larger than or equal to a third threshold value, selecting at least one directional connecting edge from the directional connecting edges connected with the corresponding nodes to delete, wherein the first threshold value is smaller than the second threshold value and smaller than the third threshold value.

In some embodiments, the building block is specifically configured to:

determining all other nodes connected with the corresponding node as candidate nodes;

Calculating a second difference between the inbound value and the outbound value of each candidate node;

and deleting the directional connection edges between the corresponding nodes and the candidate nodes in sequence according to the sequence from the second difference value to the larger value until the degree value of the corresponding nodes is equal to the third threshold value.

In some embodiments, the preset model includes a feature enhancement model and a classification model, and the obtaining module is further configured to: obtaining a classification label corresponding to each dialogue sentence in the dialogue sample;

The training module is specifically used for: and training the feature enhancement model and the classification model by utilizing the target graph structure and the classification labels.

In some embodiments, the training module is specifically configured to:

inputting the target graph structure into the feature enhancement model to obtain enhancement feature information;

inputting the enhanced feature information into the classification model to obtain a prediction result;

determining an error value between the prediction result and the classification label;

And reversely adjusting the characteristic enhancement model and the classification model according to the error value.

In some embodiments, the classification model includes an emotion classification model and an intent classification model, and the data processing apparatus further includes an identification module for:

Acquiring dialogue data to be identified;

And identifying the dialogue data based on the trained model to obtain target emotion information and target intention information.

Embodiments of the present application also provide a computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor to perform any of the data processing methods described above.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory is coupled with the computer program, and the processor is used for running the computer program in the memory so as to execute any data processing method.

According to the data processing method, the device, the storage medium and the electronic equipment, the dialogue sample is obtained, the dialogue sample comprises a plurality of dialogue sentences, the content characteristics corresponding to each dialogue sentence and the dialogue correlation characteristics among the plurality of dialogue sentences are determined according to the dialogue sample, then the target graph structure corresponding to the dialogue sample is constructed according to the content characteristics and the dialogue correlation characteristics, and then the target graph structure is utilized to train a preset model, so that model training can be carried out by combining dialogue content and dialogue correlation, the recognition capability of the model to the dialogue is effectively improved, and the man-machine interaction effect is further improved.

Drawings

The technical solution and other advantageous effects of the present application will be made apparent by the following detailed description of the specific embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flow chart of a data processing method according to an embodiment of the present application.

Fig. 2 is another flow chart of a data processing method according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a frame of a processing flow of a single video sample according to an embodiment of the present application.

Fig. 4 is a schematic diagram of stitching of individual content features according to an embodiment of the present application.

Fig. 5 is a schematic illustration of a diagram structure corresponding to dialogue sentences 1-5 according to an embodiment of the present application.

Fig. 6 is a schematic illustration of a graph structure including 8 nodes according to an embodiment of the present application.

Fig. 7 is a schematic illustration of a target graph structure corresponding to the graph structure in fig. 6.

Fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Fig. 10 is a schematic diagram of another structure of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a data processing method, a data processing device, a storage medium and electronic equipment.

Referring to fig. 1, fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application, where the data processing method may be applied to electronic devices such as a terminal device or a server, and the specific flow may include steps S101 to S104 as follows:

s101, acquiring a dialogue sample, wherein the dialogue sample comprises a plurality of dialogue sentences.

The dialog sample refers to a dialog sample with at least two modality types, the dialog sample generally includes at least two interactive objects, all dialog sentences expressed from the dialog start time to the dialog end time, and the dialog sentences may be arranged in a dialog sequence. If two interactive objects A and B are in a dialogue, and A expresses m sentences in the dialogue process, and B expresses n sentences in the dialogue process, the dialogue sentences expressed by the object A can be a 1-am, the dialogue sentences expressed by the object B can be B1-bn, the dialogue sentences a 1-am and the dialogue sentences B1-bn can be arranged according to the dialogue sequence, and the arranged dialogue sentences are used as dialogue samples.

The mode types may include a text mode, an audio mode, an image mode, and the like, and the expression form of the dialogue sentence in the dialogue sample may be different according to the number of the mode types, for example, if only two mode types including the text mode and the audio mode are included, the dialogue sentence in the dialogue sample may be expressed in a voice form, that is, each dialogue sentence is audio content expressed in an audio form, and if three mode types including the text mode, the audio mode, and the image mode are included, the dialogue sentence in the dialogue sample may be expressed in a video form, that is, each dialogue sentence is video content expressed in a video form.

S102, determining the content characteristics corresponding to each dialogue statement and dialogue association characteristics among a plurality of dialogue statements according to the dialogue sample.

The content features are fusion vectors of feature vectors of a plurality of modality types, and the fusion mode can comprise splicing. For each mode type, the dialogue sentence can extract a corresponding feature vector, for example, for the dialogue sentence expressed in an audio form, the dialogue sentence can be directly subjected to audio feature extraction (namely, the feature vector corresponding to the audio mode is obtained), and meanwhile, the dialogue sentence can be converted into a text form by the existing audio-to-text method, and then the text is subjected to text feature extraction (namely, the feature vector corresponding to the text mode is obtained). For dialogue sentences expressed in a video form, an audio part and an image part in the video can be separated by an existing audio-video separation method, for the audio part, audio feature extraction (namely, feature vectors corresponding to audio modes) can be directly carried out, and similarly, the audio part can be converted into a text form by an existing audio-text conversion method, text feature extraction (namely, feature vectors corresponding to text modes) can be carried out on the text, and for the image part, image feature extraction (namely, feature vectors corresponding to image modes) can be directly carried out.

The dialogue association feature is mainly used for indicating association relations between dialogues, such as time association relations and logic association relations, and may include dialogue sequences and dialogue logic relations. Where the conversation order and conversation timing are usually kept consistent, i.e. the conversation sentence whose conversation timing is preceding, the conversation order is also preceding. The dialogue logic relationship mainly refers to the logic relationship between dialogue sentences, and may include a response relationship, for example, if the dialogue sample is a dialogue between 3 interaction objects, and the following steps are performed according to the dialogue time sequence:

Speaker 1: too hot, the air conditioner temperature is adjusted down (dialogue statement 1);

speaker 2: preferably, tuning to 20 degrees (dialogue statement 2);

speaker 1: what is wanted to eat at night (dialogue sentence 3);

speaker 2: i want to eat pizza (dialogue statement 4);

speaker 3: too low a20 degree, i feel cold (dialogue statement 5);

It can be considered that the dialogue sentence 1 and the dialogue sentence 2 have a response relationship, the dialogue sentence 2 and the dialogue sentence 5 have a response relationship, the dialogue sentence 3 and the dialogue sentence 4 have a response relationship, the dialogue sentence 2 is a response to the dialogue sentence 1, the dialogue sentence 4 is a response to the dialogue sentence 3, and the dialogue sentence 5 is a response to the dialogue sentence 2.

It should be noted that, the dialogue sample may be a sample that is split based on dialogue sentences, for example, a split sample formed in real time for each dialogue sentence in the interaction process, and the dialogue sample may also be a complete sample that is not split, where when the dialogue sample is a complete sample that is not split, the corresponding portion of each dialogue sentence in the sample may be determined by manually marking a start timestamp and an end timestamp. The conversation order and conversation logic relationship can be determined by manual labeling, for example, the conversation order between conversation sentences can be determined by labeling forms of sequence numbers (such as sequence number 1 and sequence number 2). If a logic relationship exists between two dialogue sentences, such as a response relationship, a corresponding logic relationship labeling can be performed, and when two dialogue sentences with response relationship are specifically labeled, it can be further labeled which is the responding party and which is the responded party, for example, for the dialogue sentence 2 and the dialogue sentence 5 with response relationship, the dialogue sentence 2 is the responded party and the dialogue sentence 5 is the responding party.

In some embodiments, the session sample may be a video sample, and in this case, referring to fig. 2, the step S102 may specifically include the following steps S1021-S1023, where:

S1021, extracting sub-videos corresponding to each dialogue statement from the video sample.

The video sample is a complete non-split sample, and the sub-video corresponding to each dialogue statement can be extracted according to the starting time stamp and the ending time stamp of the manual marking in the video sample.

And S1022, determining text features, audio features and image features corresponding to the dialogue sentences according to the sub-videos.

The extraction operation of the feature vectors of the corresponding modality types may be performed by using different tools, for example, the audio portion and the image portion may be extracted from the sub-video, and then the audio feature (vector) may be extracted from the audio portion by some audio vectorization tools, such as Mockingjay. The audio portion may be converted to a text portion by an audio-to-text tool, and text features (vectors) may be extracted from the text portion by some text vectorization tool, such as word2 vec. Image features (vectors) may be extracted from the image portion by some image vectorization tool, such as VLAD.

It should be noted that, the text features are mainly used to analyze the actual content of the dialogue sentence, the audio features and the image features are mainly used to analyze the interactive object, for example, the tone, the mood, etc. of the interactive object are analyzed through the audio features, and the facial expression, etc. of the interactive object are analyzed through the image features. Since an image portion corresponding to a dialogue sentence is generally composed of a plurality of consecutive image frames, instead of a single image frame, in order to reduce the amount of subsequent calculation, before extracting image features, the image portion may be sampled, and then the image features may be extracted by the sampled image frames. Specifically, to ensure that the sampled image frames do not affect the analysis of the interactive object (e.g., the analysis of facial expressions), the sampling frequency should not be too low, for example, 1 image frame may be taken every 4 image frames for sampling. And after the image features corresponding to each image frame are obtained, vector stitching can be carried out on the image features according to time sequence, so that the image features corresponding to all the sampled image frames are obtained.

S1023, splicing at least two of the text feature, the audio feature and the image feature to obtain content features, wherein each dialogue statement corresponds to one content feature.

The feature vectors of any two mode types can be spliced, and the feature vectors of all mode types can be spliced. The splicing rules can be set manually, for example, text features, audio features and image features can be spliced in sequence according to the sequence of the text, the audio and the image, so that the obtained content features are fused with feature vectors of various modality types.

For example, referring to fig. 3, fig. 3 shows a schematic frame diagram of a processing flow of a single video sample, specifically, the video sample is split into a plurality of sub-videos, an audio portion and an image portion in the sub-videos are extracted, the audio portion may be further converted into a text portion, and then vectorization processing is performed on the audio portion, the text portion and the image portion by using some vectorization tools to obtain corresponding feature vectors, and the feature vectors are spliced into content features corresponding to each sub-video. For a single content feature, please refer to fig. 4, fig. 4 shows a schematic view of stitching a single content feature, where the single content feature may correspond to the dialogue sentence 1 "too hot, and the air conditioner temperature is adjusted down", which is formed by stitching a text vector, an audio vector, and n image vectors in turn, n is the actual number of image frames (sampled image frames) participating in the image feature extraction process, and dimensions of the text vector, the audio vector, and the image vector of the single image frame may be the same or different from each other, for example, may be 200 dimensions.

S103, constructing a target graph structure corresponding to the dialogue sample according to the content characteristics and the dialogue association characteristics.

With continued reference to fig. 3, after determining the content features corresponding to each dialogue sentence, the content features may be mapped to obtain a target graph structure.

In some embodiments, referring to fig. 2, the step S103 may specifically include:

and S1031, taking each content characteristic as a node.

Each node is a multidimensional spliced vector, the total dimension of the spliced vector is the sum of the dimensions of the corresponding feature vectors, for example, the total dimension is the sum of the dimensions of a text vector, an audio vector and an image vector of a dialogue sentence.

S1032, determining the associated node corresponding to each node from all nodes according to the dialogue associated characteristics.

When the graph structure is constructed, nodes with association relations are usually connected, and whether the association relations exist between the nodes or not is determined, besides the dialogue time sequence relations are determined, dialogue logic relations are also determined, so that the association between the nodes is considered from multiple angles, and the accuracy and the comprehensiveness of graph construction are improved. It can be generally considered that two nodes adjacent to each other in the conversation sequence (i.e., adjacent to each other in the conversation sequence) have an association relationship, and two nodes having a conversation logic relationship also have an association relationship. Considering that the two nodes are only required to perform one connection operation, it is not necessary to analyze whether the two nodes with the existing dialogue timing association have dialogue logic association.

In some embodiments, when the dialogue association feature includes a dialogue order and a dialogue logical relationship, the step S1032 may specifically include:

Sequencing the nodes according to the dialogue sequence;

acquiring any two nodes at adjacent sequencing positions, and taking the node at the next sequencing position as an associated node of the node at the previous sequencing position;

Any two nodes which are not in adjacent ordering positions and correspond to dialogue logic relations are obtained and indicated as nodes of response relations, and the nodes corresponding to the responded party in the response relations are used as associated nodes of the nodes corresponding to the responded party.

For the above-mentioned dialogue sentences 1-5, the corresponding nodes may be set as nodes 1-5, and the result of ordering the respective nodes according to the dialogue sequence may be seen in fig. 5, so it is easy to know that, based on the dialogue sequence, node 2 may be considered as the associated node of node 1, node 3 may be considered as the associated node of node 2, node 4 may be considered as the associated node of node 3, and node 5 may be the associated node of node 4, that is, any two adjacent nodes in the dialogue sequence have an association relationship, and of course, node 1 may also be considered as the associated node of node 5 (that is, the head-tail node is adjacent).

Meanwhile, for the nodes whose sorting positions are not adjacent, it is necessary to further analyze whether there is a dialogue logical relationship, for example, whether there is a dialogue logical relationship between the node 1 and the node 3, and whether there is a dialogue logical relationship between the node 2 and the node 4. From the above, since the dialogue 5 is a reply to the dialogue 2, it can be considered that there is a response relationship (dialogue logical relationship) between the node 5 and the node 2, and the node 2 can be regarded as an associated node of the node 5.

S1033, connecting each node with the corresponding associated node to construct a target graph structure corresponding to the dialogue sample.

The graph structure formed by directly connecting the nodes and the associated nodes through the connecting edges can be used as a target graph structure, and the connecting edges can be line segments without arrows at the moment, so that the formed graph structure is an undirected graph. However, if the graph structure is relatively large, for example, when more dialogue sentences result in more nodes, the calculation cost required by model training is relatively large, and the training time is relatively long, so that the graph structure can be simplified, and the graph structure after the simplification is used as a target graph structure, thereby being beneficial to reducing subsequent calculation resources and shortening the training time.

For example, in some embodiments, the step S1033 may specifically include:

Connecting each node with a corresponding associated node through a directional connecting edge to obtain a graph structure corresponding to the dialogue sample, wherein the directional connecting edge comprises an arrow end and a non-arrow end, and the arrow end is connected with the associated node;

A target graph structure is determined from the graph structure.

The directional connecting edge is a directional connecting edge, which can be expressed as a line segment with an arrow, the direction indicated by the arrow is the direction of the connecting edge, and the formed graph structure is a directional graph with directivity. With continued reference to fig. 5, the nodes 1 and 2, the nodes 2 and 3, the nodes 3 and 4, the nodes 4 and 5, and the nodes 5 and 1 are all connected by directional connection edges, and the nodes 2 and 5 having dialogue logic association relationships are also connected by directional connection edges.

When the graph structure is simplified, some nodes and/or directed connection edges can be selectively deleted according to the condition of the directed connection edges connected with each node, so as to obtain the target graph structure.

For example, in some embodiments, the step of determining the target graph structure from the graph structure may specifically include:

Counting the output value and the input value of each node in the graph structure, wherein the output value is the number of non-arrow ends connected with the corresponding node, and the input value is the number of arrow ends connected with the corresponding node;

And updating the graph structure according to the output degree value and the input degree value to obtain a target graph structure.

For example, for the above-mentioned fig. 5, the outbound value and inbound value of each of the node 1, the node 3, and the node 4 are 1, the outbound value of the node 2 is 1, the inbound value is 2, the outbound value of the node 5 is 2, and the inbound value is 1.

Considering that the larger the out-degree value of the node is, the more times that the corresponding dialogue statement is used as the reply statement of other dialogue statements is described, which is equivalent to the more times that the corresponding interaction object responds to other interaction objects, the interaction object is not dominant in the interaction process, namely, the larger the out-degree value is, the lower the influence of the corresponding interaction object is. The larger the input value of the node, the more times that other dialogue sentences are used as reply sentences of the dialogue sentences, which is equivalent to the more times that other interactive objects respond to the interactive objects, the more dominant the interactive objects tend to be in the interaction process, namely, the larger the input value is, the higher the influence of the corresponding interactive objects is. Thus, when considering which nodes and/or directed connection edges to delete, it may be determined based on the impact of the interactive object.

For example, in some embodiments, the step of updating the graph structure according to the out-degree value and the in-degree value may specifically include:

When the input degree value is smaller than a second threshold value and the output degree value is larger than or equal to a third threshold value, selecting at least one directional connecting edge from the directional connecting edges connected with the corresponding nodes for deleting, wherein the first threshold value is smaller than the second threshold value and smaller than the third threshold value.

The first threshold, the second threshold and the third threshold are all determined according to requirements, for example, the first threshold may be 0, the second threshold may be 3, and the third threshold may be 5. When the difference between the out-degree value and the in-degree value of a certain node is equal to 0, the influence of the corresponding interactive object in the conversation process can be considered to be low, the node can be deleted, and when the in-degree value of a certain node is smaller than 3, such as the in-degree value is 1, and the out-degree value is larger than or equal to 5, such as the out-degree value is 6, the corresponding interactive object can be considered to have a large probability of responding to other objects in the conversation process, and in general, the object can be considered to have a large probability of responses which are repeated or have no practical meaning, such as "good o", "one", and the like, and at the moment, some directional connection edges connected with the node can be considered to be deleted, such as deleting some directional connection edges which influence the out-degree value can be selected in a random manner, so that the out-degree value is reduced to a third threshold value 5.

For example, referring to fig. 6, fig. 6 shows a graph structure including 8 nodes, in which the outbound value of the node 3 is 1 and the inbound value is also 1, so that a first difference between the outbound value and the inbound value is equal to 0 (a first threshold value), and the node 3 should be deleted. For the node 8, the outbound value is 6, the inbound value is 0, so that the inbound value is less than 3 (the second threshold value), and the outbound value is greater than 5 (the third threshold value), 1 edge affecting the outbound value should be selected from the directional connection edges connected with the node 8 to delete, for example, the directional connection edge between the node 4 and the node 8 is deleted, so that the outbound value of the node 8 becomes 5, and the final processing results in a target graph structure which can be seen in fig. 7.

In the actual operation process, the deleting operation of the node and the directional connection edge may be triggered and executed through two judging steps, for example, whether the first difference between the judging degree value and the input degree value is equal to the first threshold value, whether the input degree value is smaller than the second threshold value, and whether the output degree value is greater than or equal to the third threshold value, where the two judging steps may be executed sequentially, specifically, which judging step is executed first may be determined according to the requirement, and the limitation is not made here. It is easy to understand that, for a node that does not have the above deletion condition, for example, when the degree of entry of a certain node is greater than the second threshold, it can be considered that the corresponding interaction object has a certain influence in the interaction process, and the node should be kept and not deleted.

Of course, when deleting the directional connection edge, besides the above-mentioned random selection manner, a suitable directional connection edge can be selected in combination with the connection situation of other associated nodes. For example, in some embodiments, the step of selecting at least one directional connection edge from the directional connection edges connected to the corresponding node to delete the directional connection edge may specifically include:

And deleting the directional connection edges between the corresponding nodes and the candidate nodes in sequence according to the sequence from the second difference value to the large difference value until the output value of the corresponding nodes is equal to the third threshold value.

For example, for the node 8 in fig. 6, the corresponding candidate nodes are nodes 1,2, 4, 5, 6 and 7, where the second difference value corresponding to the node 1 is 1, the second difference value corresponding to the node 2 is 1, the second difference value corresponding to the node 4 is 0, the second difference value corresponding to the node 5 is 2, the second difference value corresponding to the node 6 is 1, and the second difference value corresponding to the node 7 is 1, where, since the output value of the node 8 is only 1 greater than the third threshold, a directional connection edge should be selected to be deleted, and the directional connection edge between the node 1 and the node 4 should be deleted. If there are a plurality of second differences with the smallest value, the second differences can be further ordered in the order from the smaller value to the larger value, and the directed connection edge of the candidate node with the smallest value is preferentially deleted.

S104, training the preset model by using the target graph structure to obtain a trained model.

The preset model comprises a characteristic enhancement model and a classification model, wherein the characteristic enhancement model can be a graph neural network model for processing a graph structure and is mainly used for graph convolution operation, and the classification model can be determined according to actual scene tasks, such as an emotion classification model and an intention classification model in the graph 3.

It is easy to understand that for different classification models, corresponding classification labels are required to be set in advance during training, and model training is performed in combination with the classification labels and training samples. That is, before the above step S104, the data processing method may further include: and obtaining a classification label corresponding to each dialogue statement in the dialogue sample.

At this time, the step S104 may specifically include: training the feature enhancement model and the classification model using the target graph structure and the classification label.

In some embodiments, the step of training the feature enhancement model and the classification model using the target graph structure and the classification label may specifically include:

Inputting the target graph structure into a feature enhancement model to obtain enhancement feature information;

The feature enhancement model generally converts the graph structure into an adjacent matrix, and performs a graph convolution operation through the adjacent matrix to obtain feature enhancement information, for example, for the target graph structure shown in fig. 7, the corresponding converted adjacent matrix may be referred to in the following table 1:

	N1	N2	N4	N5	N6	N7	N8
								N1	0	1	0	0	0	1	1
N2	1	0	1	0	0	0	1
								N4	0	1	0	1	0	0	0
N5	0	0	1	0	1	1	1
								N6	0	0	0	1	0	1	1
N7	1	0	0	1	1	0	1
								N8	1	1	0	1	1	1	0

TABLE 1

Where N represents node, N1 represents node 1, N2 represents node 2, and so on. If there is an association relationship between two nodes, for example, a directional connection edge is connected between the node 1 and the node 2 in fig. 7, which represents that there is an association between the two nodes, then the value of the corresponding row-column intersection (i.e., at the second row position of the third column and at the third row position of the second column) may be set to 1, if there is no association relationship between two nodes, for example, there is no directional connection edge connected between the node 2 and the node 5 in fig. 7, then the value of the corresponding row-column intersection may be set to 0, in addition, the expression that a single node does not have an association relationship, that is, the value of the corresponding row-column intersection (for example, at the second row position of the second column) is set to 0, according to this rule, a matrix of 7 rows and 7 columns composed of 0 and 1 may be obtained by assigning a value to each row-column intersection in the table, that is, then the adjacency matrix is then the graph neural network model performs a graph convolution operation based on the adjacency, and the formula may be expressed as: h ^L+1＝α(A^hatH^LW^L), wherein α represents an activation function, a ^hat represents an adjacency matrix, L represents the number of network layers of the neural network model, W is the corresponding weight of the L-th layer network, H is enhancement feature information, and enhancement feature information of the first layer is an initialization vector.

It should be noted that, if the graph structure is not simplified, that is, the nodes and the directional connection edges are not deleted, but the graph structure is directly converted into the adjacency matrix, the adjacency matrix for operation will be relatively large, for example, for the graph structure in fig. 6, the adjacency matrix obtained by the conversion in the above manner can be seen in the following table 2:

	N1	N2	N3	N4	N5	N6	N7	N8
									N1	0	1	0	0	0	0	1	1
N2	1	0	1	1	0	0	1	1
									N3	0	1	0	1	0	0	0	0
N4	0	1	1	0	0	0	0	1
									N5	0	0	0	0	0	1	1	1
N6	0	0	0	0	1	0	1	1
									N7	1	1	0	0	1	1	0	1
N8	1	1	0	1	1	1	1	0

TABLE 2

From the above, it can be seen that, without performing the graph structure simplification process, the obtained adjacency matrix is an 8-row 8-column matrix composed of 0 and 1, and compared with the 7-row 7-column matrix corresponding to table 1, it can be obviously seen that the scale of the adjacency matrix obtained through the graph structure simplification process is reduced, so that the calculation cost involved in model training is reduced, the model training time and the subsequent model reasoning time can be effectively shortened, and the reaction speed of the human-computer interaction system is improved.

In some embodiments, the classification model may include an emotion classification model and an intention classification model, and the corresponding classification tags may be an emotion classification tag and an intention classification tag, wherein the emotion classification tag may include happiness, anger, sadness, fear, aversion, neutrality, and the like, and the intention classification tag may include temperature adjustment, music playing, light adjustment, and the like.

In addition, when training the preset model based on the dialogue sample and the classification label to obtain a trained model, the trained model may be subsequently used for the recognition processing in the human-computer interaction system, that is, after the step S104, the data processing method may further include:

Acquiring dialogue data to be identified;

Wherein the expression form of the dialogue data to be recognized is consistent with the expression form of the training sample, for example, when the training sample is a video sample, the dialogue data is also video data, and when the training sample is a voice sample, the dialogue data is also voice data. In general, the recognition processing of the trained model may be real-time, that is, once the user and the machine complete one-time interaction in the interaction process, the human-computer interaction system may collect corresponding dialogue data in real time for recognition, without waiting for the recognition after the completion of the whole interaction process, so as to improve the reaction speed and interaction experience of the human-computer interaction system as much as possible, and after the recognition result is obtained, the human-computer interaction system may further perform some operations, such as playing music, turning on an air conditioner, turning off light, etc., based on the recognition result.

As can be seen from the foregoing, in the data processing method provided in this embodiment, a dialogue sample is obtained, where the dialogue sample includes a plurality of dialogue sentences, and content features corresponding to each dialogue sentence and dialogue association features between the plurality of dialogue sentences are determined according to the dialogue sample, then, a target graph structure corresponding to the dialogue sample is constructed according to the content features and the dialogue association features, and then, a preset model is trained by using the target graph structure, so that model training can be performed by combining dialogue content and dialogue association relationships, and recognition capability of the model to the dialogue is effectively improved, and further, man-machine interaction effect is improved.

Based on the method described in the foregoing embodiments, the present embodiment will be further described from the perspective of a data processing apparatus, where the data processing apparatus may be implemented as a separate entity, and may be applied to an electronic device such as a server or a terminal device, where the mobile terminal may include a mobile phone, a tablet computer, an intelligent robot, and the like, and the server may be a server that can provide a man-machine interaction function.

Referring to fig. 8, fig. 8 specifically illustrates a data processing apparatus according to an embodiment of the present application, where the data processing apparatus may include: acquisition module 10, determination module 20, construction module 30, and training module 40, wherein:

(1) Acquisition Module 10

The obtaining module 10 is configured to obtain a dialogue sample, where the dialogue sample includes a plurality of dialogue sentences.

(2) Determination module 20

A determining module 20, configured to determine, according to the dialogue sample, all content features corresponding to each dialogue sentence, and dialogue association features between a plurality of dialogue sentences.

In some embodiments, the dialog samples are video samples, and the determining module 20 is specifically configured to:

extracting a sub-video corresponding to each dialogue statement from the video sample;

And splicing at least two of the text feature, the audio feature and the image feature to obtain content features, wherein each dialogue statement corresponds to one content feature.

(3) Build module 30

The construction module 30 is configured to construct a target graph structure corresponding to the dialogue sample according to the content feature and the dialogue association feature between the plurality of dialogue sentences.

In some embodiments, the build module 30 is specifically configured to:

Taking each content characteristic as a node;

according to the dialogue association characteristics, determining association nodes corresponding to each node from all nodes;

In some embodiments, the dialogue association feature includes a dialogue order and a dialogue logical relationship, and the building module 30 is specifically configured to:

Sequencing the nodes according to the dialogue sequence;

acquiring any two nodes in adjacent sorting positions, and taking the node in the next sorting position as an associated node of the node in the previous sorting position;

any two nodes which are not in adjacent ordering positions and correspond to dialogue association relation indication as response relation are obtained, and the node corresponding to the responded party in the response relation is used as the association node of the node corresponding to the responded party.

In some embodiments, the build module 30 is specifically configured to:

Connecting each node with a corresponding associated node through a directional connecting edge to obtain a graph structure corresponding to the dialogue sample, wherein the directional connecting edge comprises an arrow end and a non-arrow end, and the arrow end is connected with the corresponding associated node at each node;

A target graph structure is determined from the graph structure.

In some embodiments, the build module 30 is specifically configured to:

Counting the output value and the input value of each node in the graph structure, wherein the output value is the number of the non-arrow ends connected with the corresponding node, and the input value is the number of the arrow ends connected with the corresponding node;

In some embodiments, the build module 30 is specifically configured to:

(4) Training module 40

The training module 40 is configured to train the preset model by using the target graph structure, so as to obtain a trained model.

In some embodiments, the preset model includes a feature enhancement model and a classification model, and the obtaining module 10 is further configured to: obtaining a classification label corresponding to each dialogue sentence in the dialogue sample;

The training module 40 is specifically configured to: training the feature enhancement model and the classification model using the target graph structure and the classification label.

In some embodiments, the training module 40 is specifically configured to:

In some implementations, the classification model includes an emotion classification model and an intent classification model, and the data processing apparatus further includes an identification module for:

Acquiring dialogue data to be identified;

In the implementation, each module may be implemented as an independent entity, or may be combined arbitrarily, and implemented as the same entity or several entities, and the implementation of each module may be referred to the foregoing method embodiment, which is not described herein again.

In addition, the embodiment of the application also provides electronic equipment which can be equipment such as a smart phone and a tablet personal computer. As shown in fig. 9, the electronic device 200 includes a processor 201, a memory 202. The processor 201 is electrically connected to the memory 202.

The processor 201 is a control center of the electronic device 200, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or loading application programs stored in the memory 202 and calling data stored in the memory 202, thereby performing overall monitoring of the electronic device.

In this embodiment, the processor 201 in the electronic device 200 loads the instructions corresponding to the processes of one or more application programs into the memory 202 according to the following steps, and the processor 201 executes the application programs stored in the memory 202, so as to implement various functions:

determining content characteristics corresponding to each dialogue sentence and dialogue association characteristics among a plurality of dialogue sentences according to the dialogue sample;

Fig. 10 shows a specific block diagram of an electronic device according to an embodiment of the present invention, which may be used to implement the data processing method provided in the above embodiment. The electronic device may comprise a smart phone or a server.

The electronic device may include one or more processing cores 'processors 301, one or more computer-readable storage media's memory 302, radio Frequency (RF) circuitry 303, a power supply 304, an input unit 305, and a display unit 306, among other components. It will be appreciated by those skilled in the art that the electronic device structure shown in the figures is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

The processor 301 is the control center of the electronic device. The processor, among other things, utilizes various interfaces and lines to connect various portions of the overall electronic device, and performs various functions and processes of the electronic device by running or executing software programs and/or modules stored in memory 302, and invoking data stored in memory 302, thereby performing overall monitoring of the electronic device. Optionally, the processor may include one or more processing cores; preferably, the processor may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor.

The memory 302 may be used to store software programs (computer programs) and modules, and the processor 301 executes various functional applications and data processing by executing the software programs and modules stored in the memory 302. The memory 302 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 302 may also include a memory controller to provide the processor 301 with access to the memory 302.

The RF circuit 303 may be used for receiving and transmitting signals during the process of receiving and transmitting information, specifically, after receiving downlink information of the base station, the downlink information is processed by one or more processors 301; in addition, data relating to uplink is transmitted to the base station. Typically, RF circuitry 303 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a low noise amplifier (LNA, low Noise Amplifier), a duplexer, and the like. In addition, RF circuitry 303 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol including, but not limited to, global system for mobile communications (GSM, global System of Mobile communication), universal packet Radio Service (GPRS, general Packet Radio Service), code division multiple access (CDMA, code Division Multiple Access), wideband code division multiple access (WCDMA, wideband Code Division Multiple Access), long term evolution (LTE, long Term Evolution), email, short message Service (SMS, short MESSAGING SERVICE), and the like.

The electronic device further comprises a power supply 304 (e.g. a battery) for powering the various components, the power supply 304 preferably being logically connected to the processor 301 by a power management system, whereby charging, discharging, and power consumption management functions are performed by the power management system. The power supply 304 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 305, which input unit 305 may be used for receiving input digital or character information and for generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control. In particular, in one particular embodiment, the input unit 305 may include a touch-sensitive surface, as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations thereon or thereabout by a user (e.g., operations thereon or thereabout by a user using any suitable object or accessory such as a finger, stylus, etc.), and actuate the corresponding connection means according to a predetermined program. Alternatively, the touch-sensitive surface may comprise two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 301, and can receive and execute commands sent from the processor 301. In addition, touch sensitive surfaces may be implemented in a variety of types, such as resistive, capacitive, infrared, and surface acoustic waves. The input unit 305 may comprise other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The electronic device may also include a display unit 306, which display unit 306 may be used to display information entered by or provided to a user as well as various graphical user interfaces of the electronic device, which may be composed of graphics, text, icons, video, and any combination thereof. The display unit 306 includes a plurality of hardware display processing units, video frame processing modules, display screens, and the like. Wherein, a plurality of hardware display processing units and video frame processing modules can be integrated in a processing chip. The display screen may include a display panel, and optionally, the display panel may be configured in a form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay a display panel, and upon detection of a touch operation thereon or thereabout, the touch-sensitive surface is passed to the processor 301 to determine the type of touch event, and the processor 301 then provides a corresponding visual output on the display panel based on the type of touch event. Although in the figures the touch sensitive surface and the display panel are implemented as two separate components for input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel for input and output functions.

Although not shown, the electronic device may further include a camera, a bluetooth module, etc., which will not be described herein. The electronic equipment further comprises a first splicing module, the first splicing module comprises a signal processing module, a plurality of image processing modules connected with the signal processing module and an image splicing module connected with the image processing modules, and each image processing module is connected with a corresponding first display pixel interface. In particular, in this embodiment, the processor 301 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 302 according to the following instructions, and the processor 301 executes the application programs stored in the memory 302, so as to implement various functions as follows:

Determining content characteristics corresponding to each dialogue sentence and dialogue association characteristics among a plurality of dialogue sentences according to the dialogue samples;

The electronic device may implement the steps in any embodiment of the data processing method provided by the embodiment of the present application, so that the beneficial effects that any one of the data processing methods provided by the embodiment of the present application can implement are detailed in the previous embodiments, and are not described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be implemented by instructions (computer programs) or by controlling associated hardware by instructions (computer programs), which may be stored in a computer-readable storage medium and loaded and executed by a processor. To this end, an embodiment of the present invention provides a computer readable storage medium having stored therein a computer program that can be loaded by a processor to perform the steps of any of the embodiments of the data processing methods provided by the embodiments of the present invention.

Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The steps in any one of the data processing method embodiments provided by the embodiments of the present invention may be executed by the computer program stored in the storage medium, so that the beneficial effects that any one of the data processing methods provided by the embodiments of the present invention may be achieved, which are detailed in the previous embodiments and are not described herein.

The foregoing has described in detail a data processing method, apparatus, electronic device and storage medium according to embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A method of data processing, comprising:

Determining content characteristics of each dialogue sentence and dialogue association characteristics among the dialogue sentences according to the dialogue samples;

2. The data processing method according to claim 1, wherein the dialogue sample is a video sample, and the determining the content feature corresponding to each dialogue sentence according to the dialogue sample includes:

3. The method according to claim 1, wherein constructing a target graph structure corresponding to the dialogue sample according to the content feature and the dialogue association feature comprises:

taking each content characteristic as a node;

4. A data processing method according to claim 3, wherein the dialogue correlation feature includes a dialogue order and a dialogue logical relationship, and the determining, from all the nodes, the correlation node corresponding to each of the nodes according to the dialogue correlation feature includes:

Sorting the nodes according to the dialogue sequence;

5. A data processing method according to claim 3, wherein said connecting each of said nodes with a corresponding said associated node to construct a target graph structure corresponding to said session sample comprises:

And determining a target graph structure according to the graph structure.

6. The method of claim 5, wherein said determining a target graph structure from said graph structure comprises:

7. The method of claim 6, wherein updating the graph structure based on the out-degree value and the in-degree value comprises:

8. The method according to claim 7, wherein selecting at least one directional connection edge from the directional connection edges connected to the corresponding node for deletion comprises:

9. The data processing method according to any one of claims 1 to 8, wherein the preset model includes a feature enhancement model and a classification model, and further comprising, before training the preset model with the target graph structure:

Obtaining a classification label corresponding to each dialogue sentence in the dialogue sample;

The training of the preset model by using the target graph structure comprises the following steps: and training the feature enhancement model and the classification model by utilizing the target graph structure and the classification labels.

10. The data processing method of claim 9, wherein the training the feature enhancement model and the classification model using the target graph structure and the classification labels comprises:

11. The data processing method according to claim 9, wherein the classification model includes an emotion classification model and an intention classification model, and further comprising, after training a preset model according to the target graph structure:

Acquiring dialogue data to be identified;

12. A data processing apparatus, comprising:

A determining module, configured to determine, according to the dialogue sample, a content feature of each dialogue sentence and dialogue association features between the plurality of dialogue sentences;

13. A computer readable storage medium, characterized in that it has stored therein a plurality of instructions adapted to be loaded by a processor to perform the data processing method of any of claims 1 to 11.

14. An electronic device comprising a memory coupled to a processor, the memory having a computer program stored therein, the processor being configured to execute the computer program in the memory to perform the steps of the data processing method of any of claims 1 to 11.