CN116434027A

CN116434027A - Artificial intelligent interaction system based on image recognition

Info

Publication number: CN116434027A
Application number: CN202310686364.9A
Authority: CN
Inventors: 全一明; 张雪莹
Original assignee: Shenzhen Xingxun Technology Co ltd
Current assignee: Shenzhen Xingxun Technology Co ltd
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-07-14

Abstract

The invention relates to the technical field of image recognition, in particular to an artificial intelligent interaction system based on image recognition. The system comprises a database unit, an image recognition unit, a feature fusion unit and an intelligent interaction unit. According to the invention, the feature database of multiple interaction modes is established in the database unit, the image recognition unit is used for collecting the user images, after multiple feature data are recognized, corresponding interaction contents can be output from the database unit according to the multiple data, the multiple interaction contents are fused according to the feature fusion unit, the interaction contents with high grade are output, and are executed by the intelligent interaction unit, so that the limitation of executing the interaction operation caused by a single interaction mode is avoided, the interaction mode cannot be changed at will, the interaction operation is recognized from the multiple interaction contents, the interaction operation of the interaction execution is determined to be more accurate, and the accuracy is improved.

Description

Artificial intelligent interaction system based on image recognition

Technical Field

The invention relates to the technical field of image recognition, in particular to an artificial intelligent interaction system based on image recognition.

Background

With the rapid development of technologies such as computers, mobile devices, internet of things and cloud computing, the artificial intelligence technology has become one of the most popular technologies at present, wherein human-computer interaction is the most representative, the application potential of the human-computer interaction technology has been revealed, such as a geographic space tracking technology equipped by a smart phone, an action recognition technology applied to a wearable computer, a stealth technology, an immersion game and the like, a touch interaction technology applied to virtual reality, a remote control robot, a telemedicine and the like, and a voice recognition technology applied to occasions such as call routing, home automation and voice dialing, however, the existing artificial intelligence interaction system has some limitations such as low recognition accuracy, single interaction mode and the like, particularly when performing voice recognition for interaction, if surrounding noise is loud, the voice of a user cannot be accurately recognized, the recognized interaction content is inaccurate, and if the single interaction mode is performed, although repeated operation is repeated for many times, the interaction is inaccurate, the recognition accuracy is poor, so that we propose an artificial intelligence interaction system based on image recognition.

Disclosure of Invention

The invention aims to provide an artificial intelligent interaction system based on image recognition, which aims to solve the problems in the background technology.

In order to achieve the above purpose, the invention provides an artificial intelligent interaction system based on image recognition, which comprises a database unit, an image recognition unit, a feature fusion unit and an intelligent interaction unit;

the database unit is used for establishing a characteristic database corresponding to a plurality of interaction modes, wherein the plurality of interaction modes comprise voice interaction, lip language interaction and gesture interaction; the image recognition unit is used for collecting user images, recognizing various feature data in the input images through a deep learning algorithm, wherein the various feature data comprise voice features, lip language features and gesture features; the intelligent interaction unit is used for receiving the characteristic fusion unit and finally determining the interaction content to execute interaction operation.

As a further improvement of the technical scheme, the expression of the database is:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing a set of feature databases, +.>

Representing interactive content +.>

Representing speech features->

Representing lip language features->

Representing gesture features, n being the number of features.

As a further improvement of the technical scheme, the image recognition unit comprises an image acquisition module, a voice feature recognition module, a lip language feature recognition module and a gesture feature recognition module;

the image acquisition module is used for acquiring image data and audio data corresponding to a user through the camera, and the voice characteristic recognition module is used for recognizing the characteristics of voice content according to the audio data acquired by the image acquisition module; the lip language feature recognition module is used for recognizing lip language features corresponding to lips of a user according to the image data collected by the image collection module, and the gesture feature recognition module is used for recognizing gesture features according to the image data collected by the image collection module.

As a further improvement of the technical scheme, the voice feature recognition module, the lip language feature recognition module and the gesture feature recognition module all adopt a convolutional neural network of a deep learning algorithm for model training, and the method comprises the following steps of:

pretreatment: converting the audio data and the image data into digital signals and preprocessing the digital signals;

feature extraction: extracting features of the preprocessed audio data and the preprocessed image data;

model training: model training is carried out on the extracted features by using a convolutional neural network;

identification and output: and inputting the acquired audio data and image data into a model, so as to realize conversion of voice signals into texts, and conversion of image signals into lip language features and gesture features.

As a further improvement of the technical scheme, the lip language feature recognition module further comprises a lip contour recognition module when features are extracted, and the lip contour recognition module is used for determining the form and the dynamic features of lips in image data by adopting an edge detection algorithm.

As a further improvement of the technical scheme, the feature fusion unit comprises an interactive content determination module, a fusion analysis module and a priority definition module;

the interactive content determining module is used for transmitting the data of the voice characteristics, the lip language characteristics and the gesture characteristics to the database unit and sequentially storing the interactive contents corresponding to the voice characteristics, the lip language characteristics and the gesture characteristics output by the database unit; the fusion analysis module is used for fusing interactive contents corresponding to the voice features, the lip language features and the gesture features and comparing a plurality of parallel interactive contents; the priority definition module is used for outputting the interactive content with high occupation ratio according to the parallel condition of the interactive content, and outputting the interactive content according to the priority sequence if a plurality of interactive contents are parallel.

As a further improvement of the technical scheme, the fusion analysis module adopts a parallel comparison algorithm to judge the parallel condition of three interactive contents, and comprises the following steps:

setting three texts as t1, t2 and t3, and respectively corresponding to the voice feature, the lip language feature and the gesture feature;

the similarity of the interactive contents can be judged by calculating the average value of the editing distances of t1, t2 and t3, a similarity matrix is obtained, the parallel condition of three interactive contents is judged, and the expression is as follows:

representing the similarity of text ti and text tj, < ->

For ti, tj, the edit distance of the two texts,

representing the length of the text ti +.>

Representing the length of tj.

As a further improvement of the present technical solution, the priority sequence includes:

first-level, lip language features;

second level, gesture features;

third level, voice feature;

when a plurality of interactive contents are juxtaposed, the interactive contents are sequentially output from the first stage, the second stage and the third stage.

As a further improvement of the technical scheme, the image recognition unit further comprises an emotion analysis module, wherein the emotion analysis module is used for analyzing the current emotion of the user according to the voice characteristics, transmitting emotion signals to the intelligent interaction unit and executing matched interaction operation.

As a further improvement of the technical scheme, the emotion analysis module comprises the following steps when analyzing the emotion of the current user: relevant characteristic parameters of voice characteristics are extracted, including fundamental frequency and formant frequency of sound, and voice emotion is classified and identified through clustering, classification and classifier training of the characteristic parameters.

Compared with the prior art, the invention has the beneficial effects that:

according to the image recognition-based artificial intelligent interaction system, the feature databases of various interaction modes are established in the database unit, after the image recognition unit collects user images and recognizes various feature data, corresponding interaction contents can be output from the database unit according to various data, the various interaction contents are fused according to the feature fusion unit, the interaction contents with high occupation ratio are output, the intelligent interaction unit is used for executing the interaction, the limitation of executing the interaction operation caused by a single interaction mode is avoided, the interaction mode cannot be changed at will, the interaction operation is executed by the interaction content with high grade is recognized from the plurality of interaction contents, the interaction operation of the interaction execution is determined to be more accurate, and the accuracy is improved.

Drawings

FIG. 1 is a schematic block diagram of the overall concept of the present invention;

FIG. 2 is a schematic block diagram of an image recognition unit of the present invention;

fig. 3 is a schematic block diagram of a feature fusion unit of the present invention.

The meaning of each reference sign in the figure is:

100. a database unit;

200. an image recognition unit; 210. an image acquisition module; 220. a voice feature recognition module; 230. a lip language feature recognition module; 240. a gesture feature recognition module;

300. a feature fusion unit; 310. an interactive content determining module; 320. a fusion analysis module; 330. a priority definition module;

400. and an intelligent interaction unit.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

With the rapid development of technologies such as computers, mobile equipment, internet of things and cloud computing, an artificial intelligent technology has become one of the most popular technologies at present, wherein man-machine interaction is the most representative, and application potential of the man-machine interaction technology is shown, such as a geographic space tracking technology equipped by a smart phone, a motion recognition technology applied to a wearable computer, a stealth technology, an immersion game and the like, a touch interaction technology applied to virtual reality, a remote control robot, telemedicine and the like, and a voice recognition technology applied to occasions such as call routing, home automation and voice dialing;

referring to fig. 1-3, a first embodiment of the present invention is shown, and the present embodiment provides an image recognition-based artificial intelligent interaction system, which includes a database unit 100, an image recognition unit 200, a feature fusion unit 300, and an intelligent interaction unit 400;

the database unit 100 is configured to establish a feature database corresponding to a plurality of interaction modes, where the plurality of interaction modes includes voice interaction, lip language interaction and gesture interaction;

the expression of the database is:

representing a set of feature databases, +.>

Representing interactive content +.>

Representing speech features->

Representing lip language features->

Representing gesture features, wherein n is the feature quantity; for example, the interactive content corresponding to a1 is "hello", b1 is a voice packet feature of "hello" sent by the user, c1 is a lip feature corresponding to "hello" voice packet sent by the user, d1 is a gesture of "hello" indicated by the user, and the corresponding relation among the four element points is indicated by a1, b1, c1 and d1, so that when one element point is input, the rest element points can be output conveniently.

The image recognition unit 200 is used for collecting a user image, and recognizing various feature data in an input image through a deep learning algorithm, wherein the various feature data comprise voice features, lip language features and gesture features;

the image recognition unit 200 includes an image acquisition module 210, a voice feature recognition module 220, a lip language feature recognition module 230, and a gesture feature recognition module 240;

the image acquisition module 210 is used for acquiring image data and audio data corresponding to a user through a camera, and the voice feature recognition module 220 is used for recognizing features of voice content according to the audio data acquired by the image acquisition module 210; the lip language feature recognition module 230 is configured to recognize lip features according to the lip of the user corresponding to the image data collected by the image collection module 210, and the gesture feature recognition module 240 is configured to recognize gesture features according to the image data collected by the image collection module 210.

It should be noted that, the speech feature recognition module 220, the lip feature recognition module 230 and the gesture feature recognition module 240 all use a convolutional neural network of a deep learning algorithm for model training, and include the following steps:

pretreatment: converting the audio data and the image data into digital signals, and preprocessing the digital signals, such as removing noise, filtering and the like of the audio data, and adjusting brightness, sharpening, normalizing and the like of the image data so as to facilitate subsequent processing and analysis;

feature extraction: for the preprocessed audio data and image data, the short-time energy, frequency, spectrogram and other technologies can be used for extracting the characteristics of the audio data, and the characteristics can be used for describing the frequency, energy, the voice and tone of a speaker and the like of a voice signal, so that the content corresponding to the voice characteristics can be better captured;

model training: model training is carried out on the extracted features by using a convolutional neural network, which can be realized by training hundreds of samples and carrying out model tuning by using methods such as cross validation and the like;

identification and output: after model training is completed, the collected audio data and image data can be input into the model, so that voice signals are converted into texts, image signals are converted into lip language features and gesture features, namely corresponding contents are recognized, finally, the converted texts are output, namely the voice collected through a camera is realized, the contents corresponding to the voice features are recognized, the speech actually spoken by a user is recognized from the collected audio data, and the speech is converted into the texts.

The lip language feature recognition module 230 further comprises a lip contour recognition module when extracting features, wherein the lip contour recognition module is used for determining the form and dynamic features of the lips in the image data by adopting an edge detection algorithm; specifically, the edge detection algorithm adopts a Canny operator: the edge detection algorithm is widely applied, and is characterized by high accuracy, can detect very thin edges, and comprises the following specific processes:

firstly, carrying out Gaussian filtering on an image to smooth the image and remove Gaussian noise; calculating the gradient of the image, and finding the intensity change of each pixel point; aiming at the gradient value, non-maximum inhibition treatment is carried out, only the pixel point with the maximum local gradient change is reserved, and some non-edge pixels are inhibited; and dividing the edge pixel points and the non-edge pixel points by setting a high threshold value and a low threshold value, and finally determining the lip contour.

The feature fusion unit 300 is configured to input the feature data identified by the image identification unit 200 into the database unit 100, output interactive contents corresponding to the plurality of feature data, and fuse the plurality of interactive contents to generate final interactive contents, and the intelligent interactive unit 400 is configured to receive the final determination of the interactive contents by the feature fusion unit 300 to perform interactive operations.

The feature fusion unit 300 includes an interactive content determination module 310, a fusion analysis module 320, and a priority definition module 330;

the interactive content determining module 310 is configured to transmit data of the voice feature, the lip language feature and the gesture feature to the database unit 100, and sequentially store interactive contents corresponding to the voice feature, the lip language feature and the gesture feature output by the database unit 100; the fusion analysis module 320 is configured to fuse the interactive contents corresponding to the voice feature, the lip language feature and the gesture feature, and compare the parallel situations of the multiple interactive contents; the priority definition module 330 is configured to output the interactive content with a high ratio according to the parallel situation of the interactive content, and if a plurality of interactive contents are parallel, output the interactive content according to the priority sequence.

For example: the interactive content corresponding to the voice feature is "hello", the interactive content corresponding to the lip language feature is "hello", the interactive content corresponding to the gesture feature is "bye", and then the interactive content is compared, so that the parallel relationship can be obtained: "hello" occupies 2/3, and "bye" occupies 1/3, then the priority definition module 330 may output the interactive content with a high level, that is, a high ratio, which is "hello", however, if the interactive content corresponding to the voice feature is "hello", the interactive content corresponding to the lip feature is "handshake", the interactive content corresponding to the gesture feature is "bye", then the interactive content "hello", "handshake" and "bye" each occupy 1/3, then the output interactive content is a plurality of parallel interactive contents, that is, "hello", "handshake" and "bye", respectively, then the priority definition module 330 selects the corresponding interactive content according to the preset priority sequence.

The fusion analysis module 320 adopts a parallel comparison algorithm to judge the parallel condition of three interactive contents, and comprises the following steps:

representing the similarity of text ti and text tj, < ->

For ti, tj, the edit distance of the two texts,

representing the length of the text ti +.>

The smaller the editing distance is, the higher the similarity is, and by comparing S (i, j) with a threshold, it can be determined whether ti and tj are the same, if so, the same interactive contents are combined, otherwise, they are regarded as independent interactive contents, for three texts, a 3*3 similarity matrix can be obtained, and the following rule can be adopted to determine the juxtaposition of three interactive contents:

if all three texts are identical, they are identical, side-by-side;

if there are two texts that are identical, they are partially identical, in a side-by-side relationship;

if no text is the same, they are independent interactive content and no juxtaposition exists.

The priority sequence includes:

the first-stage and lip language features are characterized in that the interactive contents corresponding to the lip language features defined in the first order of priority are used as the output interactive contents when three interactive contents are parallel, and because gesture features are different along with the change of external scenes, voice features are affected by external noise, so that the features are inaccurate, and therefore, the lip language features are used as the priority to be the optimal choice;

the second level, gesture feature, under the condition that have not discerned the lip language characteristic, regard gesture feature as the priority;

third stage, voice feature, finally regard the interactive content that the voice feature corresponds as the priority;

In summary, considering that the existing artificial intelligent interaction system has some limitations, such as low recognition accuracy, single interaction mode, and the like, particularly when performing voice recognition to perform interaction, if surrounding noise is high, the voice of a user cannot be accurately recognized, the recognized interaction content is inaccurate, and if the single interaction mode is performed, although repeated operations are performed for many times, the interaction is inaccurate, and the recognition accuracy is poor, therefore, by establishing a feature database of multiple interaction modes in the database unit 100, the image recognition unit 200 collects user images, after recognizing multiple feature data, the corresponding interaction content can be output from the database unit 100 according to multiple data, the multiple interaction contents are fused according to the feature fusion unit 300, and the high-occupation interaction content is output and is executed by the intelligent interaction unit 400, so that the limitation of executing the interaction operation caused by the single interaction mode is avoided, the interaction operation cannot be arbitrarily changed, the high-grade interaction content is recognized from the multiple interaction contents, the interaction operation is performed, and the accuracy of the interaction operation is determined, and the accuracy is improved.

Since the artificial intelligence interaction cannot switch different interaction modes according to the emotion of the user, resulting in single interaction operation and low interest, the second embodiment of the present invention is shown, and the difference between this embodiment and the first embodiment is that the image recognition unit 200 further includes an emotion analysis module, which is used for analyzing the current emotion of the user according to the voice characteristics, and transmitting the emotion signal to the intelligent interaction unit 400 to perform the matched interaction operation, for example, the emotion analysis module analyzes that the emotion of the user is particularly violent, the intelligent interaction unit 400 performs the interaction operation in a gentle and interesting manner when performing the interaction operation, so that the user is happy, and the specific intelligent interaction unit 400 can preset multiple interaction modes, so that the corresponding interaction operation mode can be matched after the emotion is recognized.

The emotion analysis module, when analyzing the emotion of the current user, comprises the following steps: extracting relevant characteristic parameters of voice characteristics, including fundamental frequency and formant frequency of sound, classifying and identifying voice emotion through clustering, classification and classifier training of the characteristic parameters, wherein the characteristic parameters comprise:

fundamental frequency of sound: reflecting basic pitch characteristics of speech;

formant frequency: reflecting the tone and formant characteristics in the voice;

voice time-frequency characteristics: through time-frequency analysis, a short-time frequency spectrum or a Mel Frequency Cepstrum Coefficient (MFCC) of a voice signal is extracted to reflect characteristics of voice, phonemes, rhythm and the like of the voice, and by comprehensively considering the characteristic parameters, a classifier model can be used for emotion recognition and classification to obtain classification results of user emotion, such as pleasure, depression, gas generation and the like, and common classifier models comprise a Support Vector Machine (SVM), a K-nearest neighbor algorithm (KNN), a decision tree algorithm and the like.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An artificial intelligence interaction system based on image recognition is characterized in that: comprises a database unit (100), an image recognition unit (200), a feature fusion unit (300) and an intelligent interaction unit (400);

the database unit (100) is used for establishing a characteristic database corresponding to a plurality of interaction modes, wherein the plurality of interaction modes comprise voice interaction, lip language interaction and gesture interaction; the image recognition unit (200) is used for collecting a user image, recognizing various feature data in an input image through a deep learning algorithm, wherein the various feature data comprise voice features, lip language features and gesture features; the intelligent interaction unit (400) is used for receiving the characteristic fusion unit (300) and finally determining the interaction content to execute the interaction operation.

2. The image recognition-based artificial intelligence interactive system according to claim 1, wherein: the expression of the database is:

representing a set of feature databases, +.>

Representing interactive content +.>

Representing speech features->

Representing lip language features->

Representing gesture features, n being the number of features.

3. The image recognition-based artificial intelligence interactive system according to claim 1, wherein: the image recognition unit (200) comprises an image acquisition module (210), a voice feature recognition module (220), a lip language feature recognition module (230) and a gesture feature recognition module (240);

the image acquisition module (210) is used for acquiring image data and audio data corresponding to a user through a camera, and the voice characteristic recognition module (220) is used for recognizing characteristics of voice content according to the audio data acquired by the image acquisition module (210); the lip language feature recognition module (230) is used for recognizing lip language features corresponding to lips of a user according to image data collected by the image collection module (210), and the gesture feature recognition module (240) is used for recognizing gesture features according to the image data collected by the image collection module (210).

4. The image recognition-based artificial intelligence interactive system according to claim 3, wherein: the voice feature recognition module (220), the lip language feature recognition module (230) and the gesture feature recognition module (240) all adopt a convolutional neural network of a deep learning algorithm for model training, and the method comprises the following steps of:

5. The image recognition-based artificial intelligence interactive system according to claim 4, wherein: the lip language feature recognition module (230) further comprises a lip contour recognition module when the features are extracted, and the lip contour recognition module is used for determining the form and the dynamic features of the lips in the image data by adopting an edge detection algorithm.

6. The image recognition-based artificial intelligence interactive system according to claim 4, wherein: the feature fusion unit (300) comprises an interactive content determination module (310), a fusion analysis module (320) and a priority definition module (330);

the interactive content determining module (310) is used for transmitting data of the voice characteristics, the lip language characteristics and the gesture characteristics to the database unit (100), and sequentially storing interactive contents corresponding to the voice characteristics, the lip language characteristics and the gesture characteristics output by the database unit (100); the fusion analysis module (320) is used for fusing interactive contents corresponding to voice features, lip language features and gesture features and comparing a plurality of parallel interactive contents; the priority definition module (330) is configured to output the interactive content with a high ratio according to the parallel situation of the interactive content, and if a plurality of interactive contents are parallel, output the interactive content according to the priority sequence.

7. The image recognition-based artificial intelligence interactive system according to claim 6, wherein: the fusion analysis module (320) adopts a parallel comparison algorithm to judge the parallel condition of three interactive contents, and comprises the following steps:

calculating the average value of the editing distances of t1, t2 and t3, judging the similarity of the interactive contents to obtain a similarity matrix, and judging the parallel condition of the three interactive contents, wherein the expression is as follows:

representing the similarity of text ti and text tj, < ->

Edit distance of two texts of ti, tj, +.>

Representing the length of the text ti +.>

Representing the length of tj.

8. The image recognition-based artificial intelligence interactive system according to claim 7, wherein: the priority sequence includes:

first-level, lip language features;

second level, gesture features;

third level, voice feature;

9. The image recognition-based artificial intelligence interactive system according to claim 6, wherein: the image recognition unit (200) further comprises an emotion analysis module for analyzing the current emotion of the user according to the voice characteristics, transmitting an emotion signal to the intelligent interaction unit (400) to perform matched interaction operation.

10. The image recognition-based artificial intelligence interactive system according to claim 8, wherein: the emotion analysis module, when analyzing the emotion of the current user, comprises the following steps: relevant characteristic parameters of voice characteristics are extracted, including fundamental frequency and formant frequency of sound, and voice emotion is classified and identified through clustering, classification and classifier training of the characteristic parameters.