CN116434027A - Artificial intelligent interaction system based on image recognition - Google Patents

Artificial intelligent interaction system based on image recognition Download PDF

Info

Publication number
CN116434027A
CN116434027A CN202310686364.9A CN202310686364A CN116434027A CN 116434027 A CN116434027 A CN 116434027A CN 202310686364 A CN202310686364 A CN 202310686364A CN 116434027 A CN116434027 A CN 116434027A
Authority
CN
China
Prior art keywords
interaction
feature
features
module
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310686364.9A
Other languages
Chinese (zh)
Inventor
全一明
张雪莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xingxun Technology Co ltd
Original Assignee
Shenzhen Xingxun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xingxun Technology Co ltd filed Critical Shenzhen Xingxun Technology Co ltd
Priority to CN202310686364.9A priority Critical patent/CN116434027A/en
Publication of CN116434027A publication Critical patent/CN116434027A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/70Multimodal biometrics, e.g. combining information from different biometric modalities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of image recognition, in particular to an artificial intelligent interaction system based on image recognition. The system comprises a database unit, an image recognition unit, a feature fusion unit and an intelligent interaction unit. According to the invention, the feature database of multiple interaction modes is established in the database unit, the image recognition unit is used for collecting the user images, after multiple feature data are recognized, corresponding interaction contents can be output from the database unit according to the multiple data, the multiple interaction contents are fused according to the feature fusion unit, the interaction contents with high grade are output, and are executed by the intelligent interaction unit, so that the limitation of executing the interaction operation caused by a single interaction mode is avoided, the interaction mode cannot be changed at will, the interaction operation is recognized from the multiple interaction contents, the interaction operation of the interaction execution is determined to be more accurate, and the accuracy is improved.

Description

Artificial intelligent interaction system based on image recognition
Technical Field
The invention relates to the technical field of image recognition, in particular to an artificial intelligent interaction system based on image recognition.
Background
With the rapid development of technologies such as computers, mobile devices, internet of things and cloud computing, the artificial intelligence technology has become one of the most popular technologies at present, wherein human-computer interaction is the most representative, the application potential of the human-computer interaction technology has been revealed, such as a geographic space tracking technology equipped by a smart phone, an action recognition technology applied to a wearable computer, a stealth technology, an immersion game and the like, a touch interaction technology applied to virtual reality, a remote control robot, a telemedicine and the like, and a voice recognition technology applied to occasions such as call routing, home automation and voice dialing, however, the existing artificial intelligence interaction system has some limitations such as low recognition accuracy, single interaction mode and the like, particularly when performing voice recognition for interaction, if surrounding noise is loud, the voice of a user cannot be accurately recognized, the recognized interaction content is inaccurate, and if the single interaction mode is performed, although repeated operation is repeated for many times, the interaction is inaccurate, the recognition accuracy is poor, so that we propose an artificial intelligence interaction system based on image recognition.
Disclosure of Invention
The invention aims to provide an artificial intelligent interaction system based on image recognition, which aims to solve the problems in the background technology.
In order to achieve the above purpose, the invention provides an artificial intelligent interaction system based on image recognition, which comprises a database unit, an image recognition unit, a feature fusion unit and an intelligent interaction unit;
the database unit is used for establishing a characteristic database corresponding to a plurality of interaction modes, wherein the plurality of interaction modes comprise voice interaction, lip language interaction and gesture interaction; the image recognition unit is used for collecting user images, recognizing various feature data in the input images through a deep learning algorithm, wherein the various feature data comprise voice features, lip language features and gesture features; the intelligent interaction unit is used for receiving the characteristic fusion unit and finally determining the interaction content to execute interaction operation.
As a further improvement of the technical scheme, the expression of the database is:
Figure SMS_1
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_2
representing a set of feature databases, +.>
Figure SMS_3
Representing interactive content +.>
Figure SMS_4
Representing speech features->
Figure SMS_5
Representing lip language features->
Figure SMS_6
Representing gesture features, n being the number of features.
As a further improvement of the technical scheme, the image recognition unit comprises an image acquisition module, a voice feature recognition module, a lip language feature recognition module and a gesture feature recognition module;
the image acquisition module is used for acquiring image data and audio data corresponding to a user through the camera, and the voice characteristic recognition module is used for recognizing the characteristics of voice content according to the audio data acquired by the image acquisition module; the lip language feature recognition module is used for recognizing lip language features corresponding to lips of a user according to the image data collected by the image collection module, and the gesture feature recognition module is used for recognizing gesture features according to the image data collected by the image collection module.
As a further improvement of the technical scheme, the voice feature recognition module, the lip language feature recognition module and the gesture feature recognition module all adopt a convolutional neural network of a deep learning algorithm for model training, and the method comprises the following steps of:
pretreatment: converting the audio data and the image data into digital signals and preprocessing the digital signals;
feature extraction: extracting features of the preprocessed audio data and the preprocessed image data;
model training: model training is carried out on the extracted features by using a convolutional neural network;
identification and output: and inputting the acquired audio data and image data into a model, so as to realize conversion of voice signals into texts, and conversion of image signals into lip language features and gesture features.
As a further improvement of the technical scheme, the lip language feature recognition module further comprises a lip contour recognition module when features are extracted, and the lip contour recognition module is used for determining the form and the dynamic features of lips in image data by adopting an edge detection algorithm.
As a further improvement of the technical scheme, the feature fusion unit comprises an interactive content determination module, a fusion analysis module and a priority definition module;
the interactive content determining module is used for transmitting the data of the voice characteristics, the lip language characteristics and the gesture characteristics to the database unit and sequentially storing the interactive contents corresponding to the voice characteristics, the lip language characteristics and the gesture characteristics output by the database unit; the fusion analysis module is used for fusing interactive contents corresponding to the voice features, the lip language features and the gesture features and comparing a plurality of parallel interactive contents; the priority definition module is used for outputting the interactive content with high occupation ratio according to the parallel condition of the interactive content, and outputting the interactive content according to the priority sequence if a plurality of interactive contents are parallel.
As a further improvement of the technical scheme, the fusion analysis module adopts a parallel comparison algorithm to judge the parallel condition of three interactive contents, and comprises the following steps:
setting three texts as t1, t2 and t3, and respectively corresponding to the voice feature, the lip language feature and the gesture feature;
the similarity of the interactive contents can be judged by calculating the average value of the editing distances of t1, t2 and t3, a similarity matrix is obtained, the parallel condition of three interactive contents is judged, and the expression is as follows:
Figure SMS_7
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_8
representing the similarity of text ti and text tj, < ->
Figure SMS_9
For ti, tj, the edit distance of the two texts,
Figure SMS_10
representing the length of the text ti +.>
Figure SMS_11
Representing the length of tj.
As a further improvement of the present technical solution, the priority sequence includes:
first-level, lip language features;
second level, gesture features;
third level, voice feature;
when a plurality of interactive contents are juxtaposed, the interactive contents are sequentially output from the first stage, the second stage and the third stage.
As a further improvement of the technical scheme, the image recognition unit further comprises an emotion analysis module, wherein the emotion analysis module is used for analyzing the current emotion of the user according to the voice characteristics, transmitting emotion signals to the intelligent interaction unit and executing matched interaction operation.
As a further improvement of the technical scheme, the emotion analysis module comprises the following steps when analyzing the emotion of the current user: relevant characteristic parameters of voice characteristics are extracted, including fundamental frequency and formant frequency of sound, and voice emotion is classified and identified through clustering, classification and classifier training of the characteristic parameters.
Compared with the prior art, the invention has the beneficial effects that:
according to the image recognition-based artificial intelligent interaction system, the feature databases of various interaction modes are established in the database unit, after the image recognition unit collects user images and recognizes various feature data, corresponding interaction contents can be output from the database unit according to various data, the various interaction contents are fused according to the feature fusion unit, the interaction contents with high occupation ratio are output, the intelligent interaction unit is used for executing the interaction, the limitation of executing the interaction operation caused by a single interaction mode is avoided, the interaction mode cannot be changed at will, the interaction operation is executed by the interaction content with high grade is recognized from the plurality of interaction contents, the interaction operation of the interaction execution is determined to be more accurate, and the accuracy is improved.
Drawings
FIG. 1 is a schematic block diagram of the overall concept of the present invention;
FIG. 2 is a schematic block diagram of an image recognition unit of the present invention;
fig. 3 is a schematic block diagram of a feature fusion unit of the present invention.
The meaning of each reference sign in the figure is:
100. a database unit;
200. an image recognition unit; 210. an image acquisition module; 220. a voice feature recognition module; 230. a lip language feature recognition module; 240. a gesture feature recognition module;
300. a feature fusion unit; 310. an interactive content determining module; 320. a fusion analysis module; 330. a priority definition module;
400. and an intelligent interaction unit.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
With the rapid development of technologies such as computers, mobile equipment, internet of things and cloud computing, an artificial intelligent technology has become one of the most popular technologies at present, wherein man-machine interaction is the most representative, and application potential of the man-machine interaction technology is shown, such as a geographic space tracking technology equipped by a smart phone, a motion recognition technology applied to a wearable computer, a stealth technology, an immersion game and the like, a touch interaction technology applied to virtual reality, a remote control robot, telemedicine and the like, and a voice recognition technology applied to occasions such as call routing, home automation and voice dialing;
referring to fig. 1-3, a first embodiment of the present invention is shown, and the present embodiment provides an image recognition-based artificial intelligent interaction system, which includes a database unit 100, an image recognition unit 200, a feature fusion unit 300, and an intelligent interaction unit 400;
the database unit 100 is configured to establish a feature database corresponding to a plurality of interaction modes, where the plurality of interaction modes includes voice interaction, lip language interaction and gesture interaction;
the expression of the database is:
Figure SMS_12
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_13
representing a set of feature databases, +.>
Figure SMS_14
Representing interactive content +.>
Figure SMS_15
Representing speech features->
Figure SMS_16
Representing lip language features->
Figure SMS_17
Representing gesture features, wherein n is the feature quantity; for example, the interactive content corresponding to a1 is "hello", b1 is a voice packet feature of "hello" sent by the user, c1 is a lip feature corresponding to "hello" voice packet sent by the user, d1 is a gesture of "hello" indicated by the user, and the corresponding relation among the four element points is indicated by a1, b1, c1 and d1, so that when one element point is input, the rest element points can be output conveniently.
The image recognition unit 200 is used for collecting a user image, and recognizing various feature data in an input image through a deep learning algorithm, wherein the various feature data comprise voice features, lip language features and gesture features;
the image recognition unit 200 includes an image acquisition module 210, a voice feature recognition module 220, a lip language feature recognition module 230, and a gesture feature recognition module 240;
the image acquisition module 210 is used for acquiring image data and audio data corresponding to a user through a camera, and the voice feature recognition module 220 is used for recognizing features of voice content according to the audio data acquired by the image acquisition module 210; the lip language feature recognition module 230 is configured to recognize lip features according to the lip of the user corresponding to the image data collected by the image collection module 210, and the gesture feature recognition module 240 is configured to recognize gesture features according to the image data collected by the image collection module 210.
It should be noted that, the speech feature recognition module 220, the lip feature recognition module 230 and the gesture feature recognition module 240 all use a convolutional neural network of a deep learning algorithm for model training, and include the following steps:
pretreatment: converting the audio data and the image data into digital signals, and preprocessing the digital signals, such as removing noise, filtering and the like of the audio data, and adjusting brightness, sharpening, normalizing and the like of the image data so as to facilitate subsequent processing and analysis;
feature extraction: for the preprocessed audio data and image data, the short-time energy, frequency, spectrogram and other technologies can be used for extracting the characteristics of the audio data, and the characteristics can be used for describing the frequency, energy, the voice and tone of a speaker and the like of a voice signal, so that the content corresponding to the voice characteristics can be better captured;
model training: model training is carried out on the extracted features by using a convolutional neural network, which can be realized by training hundreds of samples and carrying out model tuning by using methods such as cross validation and the like;
identification and output: after model training is completed, the collected audio data and image data can be input into the model, so that voice signals are converted into texts, image signals are converted into lip language features and gesture features, namely corresponding contents are recognized, finally, the converted texts are output, namely the voice collected through a camera is realized, the contents corresponding to the voice features are recognized, the speech actually spoken by a user is recognized from the collected audio data, and the speech is converted into the texts.
The lip language feature recognition module 230 further comprises a lip contour recognition module when extracting features, wherein the lip contour recognition module is used for determining the form and dynamic features of the lips in the image data by adopting an edge detection algorithm; specifically, the edge detection algorithm adopts a Canny operator: the edge detection algorithm is widely applied, and is characterized by high accuracy, can detect very thin edges, and comprises the following specific processes:
firstly, carrying out Gaussian filtering on an image to smooth the image and remove Gaussian noise; calculating the gradient of the image, and finding the intensity change of each pixel point; aiming at the gradient value, non-maximum inhibition treatment is carried out, only the pixel point with the maximum local gradient change is reserved, and some non-edge pixels are inhibited; and dividing the edge pixel points and the non-edge pixel points by setting a high threshold value and a low threshold value, and finally determining the lip contour.
The feature fusion unit 300 is configured to input the feature data identified by the image identification unit 200 into the database unit 100, output interactive contents corresponding to the plurality of feature data, and fuse the plurality of interactive contents to generate final interactive contents, and the intelligent interactive unit 400 is configured to receive the final determination of the interactive contents by the feature fusion unit 300 to perform interactive operations.
The feature fusion unit 300 includes an interactive content determination module 310, a fusion analysis module 320, and a priority definition module 330;
the interactive content determining module 310 is configured to transmit data of the voice feature, the lip language feature and the gesture feature to the database unit 100, and sequentially store interactive contents corresponding to the voice feature, the lip language feature and the gesture feature output by the database unit 100; the fusion analysis module 320 is configured to fuse the interactive contents corresponding to the voice feature, the lip language feature and the gesture feature, and compare the parallel situations of the multiple interactive contents; the priority definition module 330 is configured to output the interactive content with a high ratio according to the parallel situation of the interactive content, and if a plurality of interactive contents are parallel, output the interactive content according to the priority sequence.
For example: the interactive content corresponding to the voice feature is "hello", the interactive content corresponding to the lip language feature is "hello", the interactive content corresponding to the gesture feature is "bye", and then the interactive content is compared, so that the parallel relationship can be obtained: "hello" occupies 2/3, and "bye" occupies 1/3, then the priority definition module 330 may output the interactive content with a high level, that is, a high ratio, which is "hello", however, if the interactive content corresponding to the voice feature is "hello", the interactive content corresponding to the lip feature is "handshake", the interactive content corresponding to the gesture feature is "bye", then the interactive content "hello", "handshake" and "bye" each occupy 1/3, then the output interactive content is a plurality of parallel interactive contents, that is, "hello", "handshake" and "bye", respectively, then the priority definition module 330 selects the corresponding interactive content according to the preset priority sequence.
The fusion analysis module 320 adopts a parallel comparison algorithm to judge the parallel condition of three interactive contents, and comprises the following steps:
setting three texts as t1, t2 and t3, and respectively corresponding to the voice feature, the lip language feature and the gesture feature;
the similarity of the interactive contents can be judged by calculating the average value of the editing distances of t1, t2 and t3, a similarity matrix is obtained, the parallel condition of three interactive contents is judged, and the expression is as follows:
Figure SMS_18
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_19
representing the similarity of text ti and text tj, < ->
Figure SMS_20
For ti, tj, the edit distance of the two texts,
Figure SMS_21
representing the length of the text ti +.>
Figure SMS_22
The smaller the editing distance is, the higher the similarity is, and by comparing S (i, j) with a threshold, it can be determined whether ti and tj are the same, if so, the same interactive contents are combined, otherwise, they are regarded as independent interactive contents, for three texts, a 3*3 similarity matrix can be obtained, and the following rule can be adopted to determine the juxtaposition of three interactive contents:
if all three texts are identical, they are identical, side-by-side;
if there are two texts that are identical, they are partially identical, in a side-by-side relationship;
if no text is the same, they are independent interactive content and no juxtaposition exists.
The priority sequence includes:
the first-stage and lip language features are characterized in that the interactive contents corresponding to the lip language features defined in the first order of priority are used as the output interactive contents when three interactive contents are parallel, and because gesture features are different along with the change of external scenes, voice features are affected by external noise, so that the features are inaccurate, and therefore, the lip language features are used as the priority to be the optimal choice;
the second level, gesture feature, under the condition that have not discerned the lip language characteristic, regard gesture feature as the priority;
third stage, voice feature, finally regard the interactive content that the voice feature corresponds as the priority;
when a plurality of interactive contents are juxtaposed, the interactive contents are sequentially output from the first stage, the second stage and the third stage.
In summary, considering that the existing artificial intelligent interaction system has some limitations, such as low recognition accuracy, single interaction mode, and the like, particularly when performing voice recognition to perform interaction, if surrounding noise is high, the voice of a user cannot be accurately recognized, the recognized interaction content is inaccurate, and if the single interaction mode is performed, although repeated operations are performed for many times, the interaction is inaccurate, and the recognition accuracy is poor, therefore, by establishing a feature database of multiple interaction modes in the database unit 100, the image recognition unit 200 collects user images, after recognizing multiple feature data, the corresponding interaction content can be output from the database unit 100 according to multiple data, the multiple interaction contents are fused according to the feature fusion unit 300, and the high-occupation interaction content is output and is executed by the intelligent interaction unit 400, so that the limitation of executing the interaction operation caused by the single interaction mode is avoided, the interaction operation cannot be arbitrarily changed, the high-grade interaction content is recognized from the multiple interaction contents, the interaction operation is performed, and the accuracy of the interaction operation is determined, and the accuracy is improved.
Since the artificial intelligence interaction cannot switch different interaction modes according to the emotion of the user, resulting in single interaction operation and low interest, the second embodiment of the present invention is shown, and the difference between this embodiment and the first embodiment is that the image recognition unit 200 further includes an emotion analysis module, which is used for analyzing the current emotion of the user according to the voice characteristics, and transmitting the emotion signal to the intelligent interaction unit 400 to perform the matched interaction operation, for example, the emotion analysis module analyzes that the emotion of the user is particularly violent, the intelligent interaction unit 400 performs the interaction operation in a gentle and interesting manner when performing the interaction operation, so that the user is happy, and the specific intelligent interaction unit 400 can preset multiple interaction modes, so that the corresponding interaction operation mode can be matched after the emotion is recognized.
The emotion analysis module, when analyzing the emotion of the current user, comprises the following steps: extracting relevant characteristic parameters of voice characteristics, including fundamental frequency and formant frequency of sound, classifying and identifying voice emotion through clustering, classification and classifier training of the characteristic parameters, wherein the characteristic parameters comprise:
fundamental frequency of sound: reflecting basic pitch characteristics of speech;
formant frequency: reflecting the tone and formant characteristics in the voice;
voice time-frequency characteristics: through time-frequency analysis, a short-time frequency spectrum or a Mel Frequency Cepstrum Coefficient (MFCC) of a voice signal is extracted to reflect characteristics of voice, phonemes, rhythm and the like of the voice, and by comprehensively considering the characteristic parameters, a classifier model can be used for emotion recognition and classification to obtain classification results of user emotion, such as pleasure, depression, gas generation and the like, and common classifier models comprise a Support Vector Machine (SVM), a K-nearest neighbor algorithm (KNN), a decision tree algorithm and the like.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. An artificial intelligence interaction system based on image recognition is characterized in that: comprises a database unit (100), an image recognition unit (200), a feature fusion unit (300) and an intelligent interaction unit (400);
the database unit (100) is used for establishing a characteristic database corresponding to a plurality of interaction modes, wherein the plurality of interaction modes comprise voice interaction, lip language interaction and gesture interaction; the image recognition unit (200) is used for collecting a user image, recognizing various feature data in an input image through a deep learning algorithm, wherein the various feature data comprise voice features, lip language features and gesture features; the intelligent interaction unit (400) is used for receiving the characteristic fusion unit (300) and finally determining the interaction content to execute the interaction operation.
2. The image recognition-based artificial intelligence interactive system according to claim 1, wherein: the expression of the database is:
Figure QLYQS_1
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_2
representing a set of feature databases, +.>
Figure QLYQS_3
Representing interactive content +.>
Figure QLYQS_4
Representing speech features->
Figure QLYQS_5
Representing lip language features->
Figure QLYQS_6
Representing gesture features, n being the number of features.
3. The image recognition-based artificial intelligence interactive system according to claim 1, wherein: the image recognition unit (200) comprises an image acquisition module (210), a voice feature recognition module (220), a lip language feature recognition module (230) and a gesture feature recognition module (240);
the image acquisition module (210) is used for acquiring image data and audio data corresponding to a user through a camera, and the voice characteristic recognition module (220) is used for recognizing characteristics of voice content according to the audio data acquired by the image acquisition module (210); the lip language feature recognition module (230) is used for recognizing lip language features corresponding to lips of a user according to image data collected by the image collection module (210), and the gesture feature recognition module (240) is used for recognizing gesture features according to the image data collected by the image collection module (210).
4. The image recognition-based artificial intelligence interactive system according to claim 3, wherein: the voice feature recognition module (220), the lip language feature recognition module (230) and the gesture feature recognition module (240) all adopt a convolutional neural network of a deep learning algorithm for model training, and the method comprises the following steps of:
pretreatment: converting the audio data and the image data into digital signals and preprocessing the digital signals;
feature extraction: extracting features of the preprocessed audio data and the preprocessed image data;
model training: model training is carried out on the extracted features by using a convolutional neural network;
identification and output: and inputting the acquired audio data and image data into a model, so as to realize conversion of voice signals into texts, and conversion of image signals into lip language features and gesture features.
5. The image recognition-based artificial intelligence interactive system according to claim 4, wherein: the lip language feature recognition module (230) further comprises a lip contour recognition module when the features are extracted, and the lip contour recognition module is used for determining the form and the dynamic features of the lips in the image data by adopting an edge detection algorithm.
6. The image recognition-based artificial intelligence interactive system according to claim 4, wherein: the feature fusion unit (300) comprises an interactive content determination module (310), a fusion analysis module (320) and a priority definition module (330);
the interactive content determining module (310) is used for transmitting data of the voice characteristics, the lip language characteristics and the gesture characteristics to the database unit (100), and sequentially storing interactive contents corresponding to the voice characteristics, the lip language characteristics and the gesture characteristics output by the database unit (100); the fusion analysis module (320) is used for fusing interactive contents corresponding to voice features, lip language features and gesture features and comparing a plurality of parallel interactive contents; the priority definition module (330) is configured to output the interactive content with a high ratio according to the parallel situation of the interactive content, and if a plurality of interactive contents are parallel, output the interactive content according to the priority sequence.
7. The image recognition-based artificial intelligence interactive system according to claim 6, wherein: the fusion analysis module (320) adopts a parallel comparison algorithm to judge the parallel condition of three interactive contents, and comprises the following steps:
setting three texts as t1, t2 and t3, and respectively corresponding to the voice feature, the lip language feature and the gesture feature;
calculating the average value of the editing distances of t1, t2 and t3, judging the similarity of the interactive contents to obtain a similarity matrix, and judging the parallel condition of the three interactive contents, wherein the expression is as follows:
Figure QLYQS_7
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_8
representing the similarity of text ti and text tj, < ->
Figure QLYQS_9
Edit distance of two texts of ti, tj, +.>
Figure QLYQS_10
Representing the length of the text ti +.>
Figure QLYQS_11
Representing the length of tj.
8. The image recognition-based artificial intelligence interactive system according to claim 7, wherein: the priority sequence includes:
first-level, lip language features;
second level, gesture features;
third level, voice feature;
when a plurality of interactive contents are juxtaposed, the interactive contents are sequentially output from the first stage, the second stage and the third stage.
9. The image recognition-based artificial intelligence interactive system according to claim 6, wherein: the image recognition unit (200) further comprises an emotion analysis module for analyzing the current emotion of the user according to the voice characteristics, transmitting an emotion signal to the intelligent interaction unit (400) to perform matched interaction operation.
10. The image recognition-based artificial intelligence interactive system according to claim 8, wherein: the emotion analysis module, when analyzing the emotion of the current user, comprises the following steps: relevant characteristic parameters of voice characteristics are extracted, including fundamental frequency and formant frequency of sound, and voice emotion is classified and identified through clustering, classification and classifier training of the characteristic parameters.
CN202310686364.9A 2023-06-12 2023-06-12 Artificial intelligent interaction system based on image recognition Pending CN116434027A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310686364.9A CN116434027A (en) 2023-06-12 2023-06-12 Artificial intelligent interaction system based on image recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310686364.9A CN116434027A (en) 2023-06-12 2023-06-12 Artificial intelligent interaction system based on image recognition

Publications (1)

Publication Number Publication Date
CN116434027A true CN116434027A (en) 2023-07-14

Family

ID=87091051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310686364.9A Pending CN116434027A (en) 2023-06-12 2023-06-12 Artificial intelligent interaction system based on image recognition

Country Status (1)

Country Link
CN (1) CN116434027A (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932212A (en) * 2012-10-12 2013-02-13 华南理工大学 Intelligent household control system based on multichannel interaction manner
US20130201105A1 (en) * 2012-02-02 2013-08-08 Raymond William Ptucha Method for controlling interactive display system
US20130300650A1 (en) * 2012-05-09 2013-11-14 Hung-Ta LIU Control system with input method using recognitioin of facial expressions
WO2016150001A1 (en) * 2015-03-24 2016-09-29 中兴通讯股份有限公司 Speech recognition method, device and computer storage medium
CN107239139A (en) * 2017-05-18 2017-10-10 刘国华 Based on the man-machine interaction method and system faced
CN107256392A (en) * 2017-06-05 2017-10-17 南京邮电大学 A kind of comprehensive Emotion identification method of joint image, voice
CN108052079A (en) * 2017-12-12 2018-05-18 北京小米移动软件有限公司 Apparatus control method, device, plant control unit and storage medium
CN111079791A (en) * 2019-11-18 2020-04-28 京东数字科技控股有限公司 Face recognition method, face recognition device and computer-readable storage medium
CN111737670A (en) * 2019-03-25 2020-10-02 广州汽车集团股份有限公司 Multi-mode data collaborative man-machine interaction method and system and vehicle-mounted multimedia device
WO2021196802A1 (en) * 2020-03-31 2021-10-07 科大讯飞股份有限公司 Method, apparatus, and device for training multimode voice recognition model, and storage medium
WO2022110564A1 (en) * 2020-11-25 2022-06-02 苏州科技大学 Smart home multi-modal human-machine natural interaction system and method thereof
CN115424614A (en) * 2022-08-31 2022-12-02 长城汽车股份有限公司 Human-computer interaction method and device, electronic equipment and vehicle
CN115620407A (en) * 2022-10-28 2023-01-17 浙江吉利控股集团有限公司 Information exchange method and device and vehicle
CN115793852A (en) * 2022-11-15 2023-03-14 长城汽车股份有限公司 Method for acquiring operation indication based on cabin area, display method and related equipment

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130201105A1 (en) * 2012-02-02 2013-08-08 Raymond William Ptucha Method for controlling interactive display system
US20130300650A1 (en) * 2012-05-09 2013-11-14 Hung-Ta LIU Control system with input method using recognitioin of facial expressions
CN102932212A (en) * 2012-10-12 2013-02-13 华南理工大学 Intelligent household control system based on multichannel interaction manner
WO2016150001A1 (en) * 2015-03-24 2016-09-29 中兴通讯股份有限公司 Speech recognition method, device and computer storage medium
CN107239139A (en) * 2017-05-18 2017-10-10 刘国华 Based on the man-machine interaction method and system faced
CN107256392A (en) * 2017-06-05 2017-10-17 南京邮电大学 A kind of comprehensive Emotion identification method of joint image, voice
CN108052079A (en) * 2017-12-12 2018-05-18 北京小米移动软件有限公司 Apparatus control method, device, plant control unit and storage medium
CN111737670A (en) * 2019-03-25 2020-10-02 广州汽车集团股份有限公司 Multi-mode data collaborative man-machine interaction method and system and vehicle-mounted multimedia device
CN111079791A (en) * 2019-11-18 2020-04-28 京东数字科技控股有限公司 Face recognition method, face recognition device and computer-readable storage medium
WO2021196802A1 (en) * 2020-03-31 2021-10-07 科大讯飞股份有限公司 Method, apparatus, and device for training multimode voice recognition model, and storage medium
WO2022110564A1 (en) * 2020-11-25 2022-06-02 苏州科技大学 Smart home multi-modal human-machine natural interaction system and method thereof
CN115424614A (en) * 2022-08-31 2022-12-02 长城汽车股份有限公司 Human-computer interaction method and device, electronic equipment and vehicle
CN115620407A (en) * 2022-10-28 2023-01-17 浙江吉利控股集团有限公司 Information exchange method and device and vehicle
CN115793852A (en) * 2022-11-15 2023-03-14 长城汽车股份有限公司 Method for acquiring operation indication based on cabin area, display method and related equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨钊 等: "组合相似度算法与知识图谱在电网数字化项目统筹中的应用研究", 《电力信息与通信技术》, vol. 21, no. 3, pages 41 - 46 *

Similar Documents

Publication Publication Date Title
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN110021308B (en) Speech emotion recognition method and device, computer equipment and storage medium
CN110838289B (en) Wake-up word detection method, device, equipment and medium based on artificial intelligence
CN110853618B (en) Language identification method, model training method, device and equipment
CN108962255B (en) Emotion recognition method, emotion recognition device, server and storage medium for voice conversation
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN110853617B (en) Model training method, language identification method, device and equipment
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN107369439B (en) Voice awakening method and device
CN108346427A (en) A kind of audio recognition method, device, equipment and storage medium
KR20210052036A (en) Apparatus with convolutional neural network for obtaining multiple intent and method therof
Alshamsi et al. Automated facial expression and speech emotion recognition app development on smart phones using cloud computing
CN111161726B (en) Intelligent voice interaction method, device, medium and system
US20230068798A1 (en) Active speaker detection using image data
CN110910898B (en) Voice information processing method and device
CN110827799A (en) Method, apparatus, device and medium for processing voice signal
CN110728993A (en) Voice change identification method and electronic equipment
CN108847251A (en) A kind of voice De-weight method, device, server and storage medium
CN116645683A (en) Signature handwriting identification method, system and storage medium based on prompt learning
CN113128284A (en) Multi-mode emotion recognition method and device
CN107180629B (en) Voice acquisition and recognition method and system
CN111048068A (en) Voice wake-up method, device and system and electronic equipment
CN116434027A (en) Artificial intelligent interaction system based on image recognition
CN113111855B (en) Multi-mode emotion recognition method and device, electronic equipment and storage medium
CN115062131A (en) Multi-mode-based man-machine interaction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination