CN117033556A - Memory preservation and memory extraction method based on artificial intelligence and related equipment - Google Patents

Memory preservation and memory extraction method based on artificial intelligence and related equipment Download PDF

Info

Publication number
CN117033556A
CN117033556A CN202311051700.9A CN202311051700A CN117033556A CN 117033556 A CN117033556 A CN 117033556A CN 202311051700 A CN202311051700 A CN 202311051700A CN 117033556 A CN117033556 A CN 117033556A
Authority
CN
China
Prior art keywords
memory
audio
scene
data
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311051700.9A
Other languages
Chinese (zh)
Inventor
凌瑞端
宋少鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sugr Electronics Corp
Original Assignee
Sugr Electronics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sugr Electronics Corp filed Critical Sugr Electronics Corp
Priority to CN202311051700.9A priority Critical patent/CN117033556A/en
Publication of CN117033556A publication Critical patent/CN117033556A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of artificial intelligence, in particular to a memory preservation and memory extraction method and related equipment based on artificial intelligence. The method comprises the steps of acquiring video data and audio data of a scene where a user is located, respectively processing the video data and the audio data to obtain image information and audio information, and then classifying and identifying the image information and the audio information to obtain a memory original text; and calling a large language model to process the memory original text, and storing the processed memory abstract and the memory original text in a database. When the inquiry problem of the user aiming at the past occurrence is received, inquiring in a database and outputting a memory abstract corresponding to the inquiry problem. The application can fill the deficiency of the memory of the user, reduce the possibility of memory omission and memory errors of the user, reduce the mental burden of the user and improve the working efficiency and the life quality; in addition, a question and answer result is output based on the query questions input by the user, a closed loop for memory preservation and memory extraction is formed, the use of the user is facilitated, and the working and living efficiency of the user is improved.

Description

Memory preservation and memory extraction method based on artificial intelligence and related equipment
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a memory preservation and memory extraction method and related equipment based on artificial intelligence.
Background
The occurrence and popularization of the Internet bring a large amount of information to users, so that more and more information needs to be processed by human beings, and the phenomenon of information omission of memory of the human beings is caused. With the continuous development of artificial intelligence, many tools for helping people to complete memory, such as paper notebooks, notebook software on computers, conference summary software for video conferences and the like, are presented, but all these tools need to be actively opened for use by people, and have the problem of inconvenient carrying and use.
Disclosure of Invention
In view of the above, the application provides a memory preservation and memory extraction method based on artificial intelligence and related equipment, which are used for filling the deficiency of memory of users, reducing the possibility of memory omission and memory errors of users and solving the technical problems of inconvenient carrying and use existing in the prior art.
A first aspect of the present application provides an artificial intelligence based memory preservation and memory retrieval method, the method comprising:
collecting video data and audio data of a scene where a user is located;
Processing the video data to obtain image information, and processing the audio data to obtain audio information;
classifying and identifying the image information and the audio information to obtain a memory original text;
calling a large language model to process the memory original text, and storing the processed memory abstract and the memory original text in a database;
and when the query questions of the user are received, querying in the database and outputting memory abstracts corresponding to the query questions.
In an optional embodiment, the processing the video data to obtain image information includes:
the video data are subjected to dynamic framing acquisition by combining a scene transformation detection algorithm and a rate prediction algorithm, so that a plurality of image data are obtained;
performing content segmentation on each image data to obtain image data blocks;
and carrying out image recognition on the image data block to obtain the image information.
In an optional embodiment, the combining the scene change detection algorithm and the rate prediction algorithm to dynamically frame-collect the video data, and obtaining the plurality of image data includes:
Performing scene detection on the video data by using a scene change detection algorithm to obtain a video scene type;
performing adaptive transformation rate prediction on video data corresponding to each video scene type by using a rate prediction algorithm;
when the predicted transformation rate is higher than a preset rate threshold, performing frame rate acquisition on the video data by adopting a first preset frame rate to obtain a plurality of image data corresponding to the video scene type;
when the predicted transformation rate is lower than the preset rate threshold, performing frame rate acquisition on the video data by adopting a second preset frame rate to obtain a plurality of image data corresponding to the video scene type;
wherein the first preset frame rate is greater than the second preset frame rate.
In an optional embodiment, the processing the audio data to obtain audio information includes:
the audio data are acquired in frames to obtain a plurality of sub-audio data;
detecting whether the acquisition scene of the audio data is transformed according to a scene transformation detection algorithm;
when the acquisition scene of the audio data is transformed, performing scene classification on the sub-audio data of which the acquisition scene is transformed to obtain an audio scene type;
Carrying out audio layering on each sub-audio data to obtain layered audio;
and carrying out audio recognition on the layered audio to obtain the audio information.
In an alternative embodiment, said classifying and identifying the image information and the audio information to obtain the memory original text includes:
and carrying out classification and identification on the image information to obtain an image text, carrying out classification and identification on the audio information to obtain an audio text, and carrying out semantic association on the image text and the audio text to obtain the memory original text.
In an alternative embodiment, said semantically associating said image text and said audio text comprises:
the image text and the audio text are semantically associated based on scene or time or place or subject to structurally merge the image text and the audio text.
In an alternative embodiment, the method further comprises:
classifying, compressing and storing the corresponding image information according to the video scene type; and
And storing the audio scene type and the corresponding audio information.
In an alternative embodiment, when the query question is a voice query question input by the user in a voice form, the querying and outputting the memory abstract corresponding to the query question in the database includes:
Performing voice recognition on the voice query problem to obtain a text query problem;
and inquiring and outputting a memory abstract corresponding to the text inquiry problem in the database.
A second aspect of the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the artificial intelligence based memory preservation and memory retrieval method when executing the computer program.
A third aspect of the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the artificial intelligence based memory preservation and memory retrieval method.
In summary, the memory saving and memory extracting method and the related device based on artificial intelligence provided by the embodiments of the present application collect video data and audio data of a scene where a user is located, respectively process the video data and the audio data to obtain image information and audio information, then classify and identify the image information and the audio information to obtain a memory original text, invoke a large language model to process the memory original text, and store the processed memory abstract and the memory original text in a database, so that the defect of memory of the user can be filled, the possibility of memory omission and memory errors of the user can be reduced, the mental burden of the user can be reduced, and the working efficiency and life quality can be improved; in addition, the question and answer result is output based on the query questions input by the user, a closed loop for memory preservation and memory extraction is formed, the use of the user can be facilitated, and the working and living efficiency of the user is improved.
Drawings
FIG. 1 is a flow chart of an artificial intelligence based memory preservation and memory retrieval method according to an embodiment of the present application;
FIG. 2 is a data flow diagram illustrating processing of video data according to an embodiment of the present application;
FIG. 3 is a flow chart illustrating processing of video data according to an embodiment of the present application;
FIG. 4 is a data flow diagram illustrating processing of audio data according to an embodiment of the present application;
FIG. 5 is a flow chart illustrating processing of audio data according to an embodiment of the present application;
FIG. 6 is a block diagram of an electronic device according to an embodiment of the application;
fig. 7 is a block diagram of another electronic device shown in an embodiment of the application.
Detailed Description
The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure is intended to encompass any or all possible combinations of one or more of the listed items.
The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.
FIG. 1 is a flow chart illustrating an artificial intelligence based memory preservation and memory retrieval method according to an embodiment of the present application. The memory preservation and memory extraction method based on the artificial intelligence can be executed by the electronic equipment, and specifically comprises the following steps of.
S11, collecting video data and audio data of a scene where a user is located.
The data of the scene where the user is located is collected in real time and used as memory to be stored, so that a memory preservation effect is achieved; later, when the user forgets or is ambiguous, the user can search or retrieve the stored data, thereby playing a role in memory extraction. The data of the scene where the user is located includes video data and audio data of the scene where the user is located.
The video data of the scene in which the user is located may include visual information in front of the user and all image information that the user browses on the display of the user side device (e.g., computer). The audio data of the scene in which the user is located may include audible information in front of the user and all sound information received by a microphone of the user side device (e.g., a computer). The collection of video data and audio data of a scene where a user is located corresponds to the collection of all visual information that the user can see and audible information that the user can hear.
In some embodiments, the video data of the scene in which the user is located may be acquired by an image acquisition device, which may be built into the electronic device, for example, a camera of the electronic device, or may be independent of the electronic device. When the image acquisition device is independent of the electronic device, the image acquisition device can transmit the acquired video data of the scene where the user is located to the electronic device in a wired or wireless mode.
In some embodiments, audio data of a scene in which a user is located may be collected using an audio collection device, which may be built into the electronic device, for example, a microphone of the electronic device, or may be independent of the electronic device. When the audio acquisition device is independent of the electronic device, the audio acquisition device can transmit the acquired audio data of the scene where the user is located to the electronic device in a wired or wireless mode.
In some embodiments, the electronic device may be a bracelet-type, glasses-type, necklace-type, helmet-type or handheld-type device, and any device that can implement the memory preservation and memory extraction method based on artificial intelligence provided by the embodiment of the application may be included in the application, regardless of the structural form.
S12, processing the video data to obtain image information, and processing the audio data to obtain audio information.
When processing the video data, referring to fig. 2 and 3 together, the method for processing the video data to obtain image information specifically includes the following steps:
s21, combining a scene change detection algorithm and a rate prediction algorithm to dynamically acquire the video data in frames to obtain a plurality of image data.
Since it is considered that the video data is not used for viewing or communication, the framing frequency may be set to one or more frames per second, i.e. by cyclically reading the video data frame by frame, one frame per second or multiple frames per second, to achieve dynamic framing acquisition of the video data.
In an optional embodiment, the combining the scene change detection algorithm and the rate prediction algorithm to dynamically frame-collect the video data, and obtaining the plurality of image data includes:
Performing scene detection on the video data by using a scene change detection algorithm to obtain a video scene type;
performing adaptive transformation rate prediction on video data corresponding to each video scene type by using a rate prediction algorithm;
when the predicted transformation rate is higher than a preset rate threshold, performing frame rate acquisition on the video data by adopting a first preset frame rate to obtain a plurality of image data corresponding to the video scene type;
when the predicted transformation rate is lower than the preset rate threshold, performing frame rate acquisition on the video data by adopting a second preset frame rate to obtain a plurality of image data corresponding to the video scene type;
wherein the first preset frame rate is greater than the second preset frame rate.
The embodiment of the application can set the initial acquisition frame rate, for example 15 frames/second, according to the required application scene and system requirements. The method comprises the steps of firstly carrying out frame-by-frame acquisition on video data by using an initial acquisition frame rate, and recording the time stamp and frame rate information of each acquired frame of image data.
And meanwhile, the acquired image data is analyzed, and scene change detection is used for detecting whether the acquired scene of the video data is changed or not. The background difference method, the inter-frame difference method, the optical flow method, the background modeling method, the characteristic point matching method and the like can be used for detecting whether the acquired scene changes, such as object movement, illumination change, article appearance/disappearance and the like. For example, the electronic device may obtain a weighted average of color information of the image data, and calculate an inter-frame difference between the front and rear image data according to the weighted average of the color information. The color information may include, but is not limited to: luminance information, chrominance information, saturation information, texture information, and the like. The weighted average of the color information of the image data may be obtained by converting the image data from an RGB color space to an HSV (hue, saturation, brightness) color space, extracting a value of a brightness channel of each pixel for each pixel of the image data, assigning a weight to the brightness value of each pixel, multiplying the brightness value of each pixel by its corresponding weight to obtain a plurality of weighted values, and then adding the plurality of weighted values and dividing by a sum of the total weights, thereby obtaining a weighted average of the color information of the image data, and obtaining an inter-frame difference of the two image data. When the inter-frame difference is determined to be larger than a preset difference threshold, indicating that a larger difference exists between the front image data and the rear image data, and determining that the acquisition scene of the video data is transformed; and when the inter-frame difference is smaller than the preset difference threshold value, the fact that the difference between the front image data and the rear image data is almost the same or smaller is indicated, and the fact that the acquisition scene of the video data is not transformed is determined.
And triggering logic for predicting the transformation rate and adjusting the acquisition frame rate in the current acquisition scene. When the acquisition scene is transformed, logic of transformation rate prediction and acquisition frame rate adjustment is triggered again. The rate of change of the scene is predicted using machine learning, statistical analysis, or other algorithms. Data such as frame rate, transform frequency, transform amplitude, etc. over a period of time are analyzed and modeled to estimate a future transform rate. Prediction of the transformation rate may be performed using regression analysis, time series analysis, or model-based prediction methods of the historical data.
Based on the prediction result of the transformation rate, it is decided whether the frame rate of the acquisition should be increased or decreased. If the predicted transformation rate is high, the video data is frequently changed, and the frame rate is required to be fast for acquisition, so that a smoother picture is provided. If the predicted transformation rate is slow, the change of the representative video data is slow, the frame rate can be reduced for acquisition, and the storage space and the processing resources are saved.
For example, in the current scene, adaptive transformation rate prediction is performed, video data with a fast transformation rate (i.e., the predicted transformation rate is higher than a preset rate threshold) is collected at a fast frame rate (a first preset frame rate), for example, when the vehicle runs on a road, for an automobile front window, the video data can be collected at a frame rate of 120 frames/second; for video data with a slow conversion rate (i.e., a predicted conversion rate below a preset rate threshold), the video data may be collected at a slow frame rate (a second preset frame rate), e.g., when a book is being read, a page may be turned every minute, and a frame rate of 1 frame/minute may be used for the video data collection. When the acquired scene of the detected video data is transformed, marking as a new scene type, carrying out self-adaptive transformation rate prediction on the new scene type, acquiring a scene with a fast transformation rate by adopting a fast frame rate, and acquiring a scene with a slow transformation rate by adopting a slow frame rate. And repeating the steps until the user turns off the image acquisition equipment to obtain a plurality of video scene types, wherein each video scene type corresponds to a plurality of image data.
In the embodiment of the application, the scene change detection algorithm is utilized to detect whether the acquisition scene of the video data is changed or not while the video data is acquired. And when detecting that the acquisition scene is transformed, triggering the prediction of the transformation rate and the logic of the adjustment of the acquisition frame rate. The embodiment of the application combines real-time video data acquisition, scene change detection and conversion rate prediction to dynamically adjust the acquisition frame rate to adapt to the changes in different scenes. For a dynamic scene with faster transformation, the fast frame rate is used for acquisition, so that the change details of the acquired scene can be accurately captured, and the storage efficiency and the quality of subsequent video data processing are improved; for static or slow-changing scenes, the slow frame rate is used for acquisition, unnecessary image data processing and computing resource use can be reduced, and storage space and transmission bandwidth are saved, so that the efficiency of processing video data is improved.
It should be noted that, only when there is a scene change, the scene classification is triggered, and no scene change triggers the scene classification, which includes the scene identification. And performing scene classification on the image data to obtain video scene types, and flexibly designing and adjusting according to specific tasks or user requirements.
In some embodiments, a training scene classification model such as a support vector machine (Support Vector Machine, SVM), random Forest (Random Forest), deep neural network, etc. may be used, and the electronic device may use the trained scene classification model to perform scene classification on the image data in which the acquired scene is transformed, so as to obtain a scene type. In order to facilitate distinction from the following, a scene type obtained by scene classification of image data is referred to as a video scene type. The video scene types may include, but are not limited to: text, character, scenery, etc.
Illustratively, when the image data contains text (text labels, billboards, etc.), then the video scene type of the image data is a text type; when the image data contains a person (a person region in the image data can be identified by using a face detection or target detection algorithm), the video scene type of the image data is a person type; when the image data does not contain obvious text and characters, for example, scenery, buildings, natural environment, and the like, the video scene type of the image data is scenery.
In some embodiments, after obtaining the plurality of image data, the plurality of image data may be processed, e.g., size normalized, image denoising, image enhancement, etc., to improve accuracy and efficiency in detecting a transform of an acquisition scene of the video data, and to improve accuracy and efficiency in classifying the image data.
S22, content segmentation is carried out on each image data, and image data blocks are obtained.
The electronic device may perform content segmentation on each of the image data using a pre-stored content segmentation algorithm to extract one or more object blocks in each of the image data. The content segmentation algorithm may include, but is not limited to: semantic segmentation, instance segmentation, etc.
For example, assuming that a certain image data corresponds to a library scene, after the image data is subjected to content segmentation by a content segmentation algorithm, three image data blocks can be obtained, wherein the content of one image data block is a book corresponding to a foreground, the content of one image data block is a table corresponding to a middle scene, and the content of the other image data block is a library corresponding to a background.
S23, carrying out image recognition on the image data block to obtain the image information.
The electronic equipment can input each image data block obtained through segmentation into a trained image recognition model for recognition to obtain corresponding image information. For example, the image information obtained by the character image data block recognition is "character", the image information obtained by the car image data block recognition is "car", and the image information obtained by the animal image data block recognition is "animal", etc.
It should be appreciated that different identification methods may be used for different types of image data blocks, e.g., face recognition techniques may be used for person image data blocks, vehicle model identification techniques may be used for car image data blocks, animal species classification techniques may be used for animal image data blocks, etc.
In an alternative embodiment, the electronic device stores the video scene type and the corresponding image information.
The electronic equipment can store the image information in a classified mode according to the video scene type. That is, image information having the same video scene type is stored in the same location, and image information having different video scene types is stored in different locations. For example, assuming that the first to third pieces of image data correspond to the same video scene type, for example, a person, the image information of the first to third pieces of image data and the corresponding video scene type (person) are stored. Assuming that the fourth image data corresponds to the same video scene type as the fifth image data, for example, a conversation class, the image information of the fourth image data and the fifth image data and the corresponding video scene type (conversation class) are stored.
In an alternative embodiment, the electronic device may further perform classification compression on the corresponding image information according to the video scene type. For example, assuming that the image information is text, such as books and newspapers, business cards, license plates, compression is performed by using a compression algorithm suitable for the text. Assuming that the image information is a scene, a general compression algorithm is adopted for compression.
And storing the video scene type and the corresponding image information, so that the structural storage of the image information can be realized.
The purpose of classifying and compressing the image information is to select the most suitable compression mode for the image information of different scene types, so as to reduce storage.
When processing the audio data, referring to fig. 4 and fig. 5, the method for processing the audio data to obtain audio information specifically includes the following steps:
s41, carrying out frame acquisition on the audio data to obtain a plurality of sub-audio data.
The electronic device may preset an audio acquisition frame rate, and acquire audio data according to the preset audio acquisition frame rate. The frame-wise acquisition of audio data refers to dividing a continuous audio signal into a plurality of short-period audio data, each of which is sub-audio data. Audio data is typically present in the form of continuous analog signals, sampled and discretized into digital signals in an audio acquisition device.
S42, detecting whether the acquisition scene of the audio data is transformed according to a scene transformation detection algorithm.
Scene change detection algorithms may include, but are not limited to: statistical-based methods, machine learning algorithms (e.g., support vector machines, decision trees, random forests, etc.), and deep learning algorithms (e.g., convolutional neural networks, etc.).
In other embodiments, the electronic device may further extract, from the sub-audio data, a feature for detecting whether a captured scene of the audio data is transformed. The features may include, but are not limited to: time domain features (e.g., volume, energy, etc.), frequency domain features (e.g., spectral centroid, spectral average energy, etc.), time-frequency features (e.g., short-time fourier transform coefficients, etc.), and the like. Whether the acquisition scene of the audio data is transformed or not is detected based on the extracted features according to a scene transformation detection algorithm.
S43, when the acquisition scene of the audio data is transformed, classifying the scene of the sub-audio data of which the acquisition scene is transformed, and obtaining the type of the audio scene.
The scene type obtained by classifying the sub-audio data is referred to as an audio scene type.
In some embodiments, the electronic device may extract relevant feature vectors from each of the sub-audio data using a machine learning algorithm (e.g., support vector machine, decision tree, random forest, etc.) for scene classification, resulting in an audio scene type. The audio scene types may include, but are not limited to: dialog classes, music classes, scene classes, etc.
Illustratively, the audio scene type is determined to be a conversational class assuming that the sub-audio data contains conversational sounds of people (e.g., phone calls, conference discussions, etc.). Conversational audio data typically has pronounced speech characteristics such as time-frequency characteristics of speech, speech speed, intonation, etc. Assuming that the sub-audio data includes audio representing music play, performance, singing, etc., the audio scene type is determined to be a music class. Musical audio data typically has unique spectral characteristics, tempo and instrument sounds. Assuming that the sub-audio data includes audio representing a background environment (e.g., city street noise, natural environment sounds, traffic sounds, etc.), the audio scene type is determined to be a scene class. Scene-type audio data typically contains features such as ambient noise, sound texture, etc.
It should be noted that, if there is a scene change, the scene classification is triggered, otherwise, if there is no scene change, the scene classification is not triggered, and the scene classification includes scene recognition. And carrying out scene classification on the sub-audio data to obtain the audio scene type, wherein the audio scene type can be flexibly designed and adjusted according to specific tasks or user requirements.
S44, carrying out audio layering on each piece of sub-audio data to obtain layered audio.
In some embodiments, the sub-audio data may be audio layered based on the distributed characteristics of the audio signal over time, or scene scenarios that are often encountered by people in daily life. The audio layering may be divided into front-layer-dialog, middle-layer-music, back-layer-background. Each piece of sub-audio data can be input into a pre-trained audio layering recognition model, and an audio layering result of each piece of sub-audio data is output through the audio layering recognition model, namely whether the sub-audio data belongs to front-layer dialogue, middle-layer music or rear-layer background is determined.
Illustratively, assume that the sub-audio data is a scene in which a person is talking in a cafe. In this scenario, the front-layer-dialog may be a person's actual conversational sounds (e.g., language, spoken sounds, etc.); the middle layer-music may be background music played by a cafe; the back-background may be background noise and ambient sound (e.g., a person's footstep, a coffee machine's sound, an echo of the environment, etc.).
By the alternative implementation mode, as the audio characteristics of the audio data with different layers correspond to different scene elements, the processing efficiency and quality are improved by dividing the sub-audio data into front, middle and rear three layers and even more layers and processing the audio information of the appointed layers. For example, if one wants to extract the conversation content of people in a cafe, one can focus on the front-layer conversation part and reduce the impact of middle-layer music and back-layer background when processing.
S45, carrying out audio recognition on the layered audio to obtain audio information.
The electronic device may input each layered audio to a trained audio recognition model (e.g., deep learning based convolutional neural network (Convolutional Neural Network, CNN), cyclic neural network (Recurrent Neural Network, RNN), transducer, etc.) for recognition to obtain corresponding audio information. The audio information may include content, features, etc. of the audio data.
For example, assuming that a certain hierarchical audio is music data, the electronic device may output "pop music", "rock music", "classical music", and the like as recognition results through the trained audio recognition model.
In some implementations, the electronic device may also obtain audio information of the layered audio, such as the title of the song, the name of the artist, etc., through the trained audio recognition model.
In an alternative embodiment, the electronic device stores the audio scene type and the corresponding audio information.
The electronic equipment can store the audio information in a classified mode according to the audio scene type. That is, audio information having the same audio scene type is stored in the same location, and audio information having different audio scene types is stored in different locations.
And storing the audio scene type and the corresponding audio information, so that the audio information can be stored in a structured manner.
S13, classifying and identifying the image information and the audio information to obtain a memory original text.
And storing the image information obtained by processing the video data, the audio scene type corresponding to each image information, the audio information obtained by processing the audio data and the audio scene type corresponding to each audio information in a memory arrangement module of the electronic equipment, so that the memory arrangement module of the electronic equipment classifies and identifies the image information and the audio information to obtain a memory original text.
The memory sorting module of the electronic equipment carries out classification recognition on the image information and the audio information, and obtaining the memory original text can comprise carrying out classification recognition on the image information to obtain an image text, carrying out classification recognition on the audio information to obtain an audio text, and carrying out semantic association on the image text and the audio text to obtain the memory original text.
In some embodiments, the electronic device may utilize natural language processing techniques, speech recognition techniques, etc. to extract image text from image information, and audio text from audio information, i.e., converting the image information and audio information into text descriptions, text labels, text keywords, etc. The memory original text may be descriptive text related to image information or audio information for recording key information, content summaries or identification information. For example, classifying and identifying the image information can obtain class labels such as 'people', 'scenes', 'objects', and the like; classification recognition of audio information may result in tags of the sound type, such as "speaking", "music", etc.
In some implementations, the image text and the audio text may be semantically associated based on scene or time or place or subject matter to structurally merge the image text and the audio text.
The electronic device performs semantic association on the image text and the audio text based on a scene, time, place or theme, that is, performs matching, correspondence or association on the image text and the audio text based on the scene, time, place or theme, so as to achieve semantic connection between the image text and the audio text. And carrying out semantic association on the image text and the audio text to obtain combined data after the image text and the audio text are structurally combined, wherein the combined data is used as a memory original text. The consolidated data may be in a text format (e.g., XML, JSON, etc.).
By way of example, assuming that the video data illustrates an indoor environment of a cafe, visual features may be extracted from the image information, and the visual features may include: cafes, coffee, indoor environments, etc. While the voice of the person in the audio data mentions the name of the cafe, audio features may be extracted from the audio information, which may include the name of the cafe, the price of the coffee, etc. Since both visual and audio features comprise "cafes", semantic association can be made from image information and audio information.
Through the optional implementation manner, the image information and the audio information can be related to a common semantic concept through semantic association, and the image information and the audio information are combined into a unified data storage form in a structured manner, so that the integration and sharing of multi-mode data are facilitated, the accuracy and the relevance of search results are improved, and richer and more accurate data description and analysis results are provided for users.
S14, calling a large language model to process the memory original text, and storing the processed memory abstract and the memory original text in a database.
Large language models refer to models that are trained through machine learning and artificial intelligence techniques for understanding and generating natural language text. The large language model can realize tasks such as semantic understanding, text generation, question and answer and the like by training on large-scale text data, and has strong language processing capability.
The electronic device can call an application program interface (Application Programming Interface, API) of the large language model periodically (the period can be determined according to the data amount generated by the scene change) or directly call the locally stored large language model to process the memory original text obtained by the memory sorting module so as to obtain a memory abstract.
The electronic equipment stores the memory abstract and the memory original text in a database at the same time, and associates the memory abstract and the memory original text for later retrieval.
And S15, inquiring in the database and outputting a memory abstract corresponding to the inquiry problem when the inquiry problem of the user is received.
The user can input the inquiry problem in the electronic equipment in a voice input or keyboard input mode. The query problem is a query problem of past occurrence, namely, when a user forgets to memorize, the query can be performed according to visual information and auditory information before the user acquired in a past scene.
In an alternative embodiment, when the user inputs the query question in the electronic device through the keyboard input mode, since the query question input through the keyboard is a text-form query question (text query question), the electronic device may directly query in the database and output the memory abstract corresponding to the text-form query question. The electronic device may also output a memory original text corresponding to the memory digest.
In an alternative embodiment, when a user inputs a query question in an electronic device by means of voice input, since the query question input by voice is a query question in voice form (voice query question), the electronic device needs to perform voice recognition on the query question in voice form to obtain a query question in text form, and then query in the database and output a memory abstract corresponding to the query question in text form.
In some embodiments, the user's voice query questions may be collected by the audio collection device or voice recording device, which are converted to text query questions using voice recognition techniques (e.g., automatic speech recognition (Automatic Speech Recognition, ASR) techniques). In other embodiments, the voice query may be pre-processed (e.g., noise removed, audio level reduced, voice signal enhancement, etc.) prior to voice recognition of the voice query to improve accuracy of subsequent voice recognition of the voice query. After the speech query question is speech-recognized, the text query question that was recognized may be post-processed (e.g., to remove recognition errors, punctuation processing, etc.) to obtain a more accurate text query question.
The following are examples of several application scenarios of the artificial intelligence based memory preservation and memory retrieval method of the present application.
In the application scene I, a user U1 chatts with a friend A today, negotiates that an apple public meeting holds a new product release meeting in the 1-early morning of 6 months and 6 days, the user U1 forgets after hearing, and the user U1 inputs a query title of ' today ' at night, and then outputs a query result in the form of voice or characters as ' information in today ' chat ': apple release meeting, 6 months 6 days 1 am).
In the second application scenario, the user U2 starts a video conference with the client B in the week, the PPT of the client B in the conference displays the market capacity of the following year, the user U2 forgets the market capacity, the user U2 inputs a query title of "what the last week client mentioned the market capacity", and the result of the question and answer is output in the form of voice or characters as the market capacity: XXX.
In the third application scenario, the user U3 goes to the XX hotpot shop to eat the hotpot in the last week with the family C, the user U3 forgets the location of the hotpot shop, and the user U3 inputs a query title of "where the XX hotpot shop in the last week of me is located", and outputs a question-answering result in the form of voice or text as "XX hotpot shop in the last week of you is located on XX street XX number". The electronic device can display the location of the XX hotpot, the telephone of the XX hotpot, plan related paths, consumption packages of the hotpot and the like in a text mode, and can display pictures (such as a door photo) related to the XX hotpot in a picture mode.
It should be noted that, the memory saving and memory extracting method based on artificial intelligence in the application can be executed by all electronic devices, or can be executed jointly in a mode of combining cloud and electronic devices, namely, the electronic devices acquire video data and audio data of a scene where a user is located in real time, process the video data to obtain image information, process the audio data to obtain audio information, process, understand and extract a memory abstract by a cloud operation large language model based on the image information and the audio information, and the electronic devices save the memory abstract extracted by the cloud for inquiry, thereby realizing reminding of memory.
In addition, the application can also only acquire the audio data of the scene where the user is, and realize the storage of the memory and the extraction of the memory based on the audio data.
The method comprises the steps of obtaining video data and audio data of a scene where a user is located, respectively processing the video data and the audio data to obtain image information and audio information, classifying and identifying the image information and the audio information to obtain a memory original text, calling a large language model to process the memory original text, and storing a memory abstract obtained by processing and the memory original text in a database. And when the inquiry problem of the user aiming at the past occurrence is received, inquiring in the database and outputting a memory abstract corresponding to the inquiry problem. The application records and saves the video data and/or audio data of the scene where the user is located at any time and any place, which is equivalent to increasing a foreign brain memory for the user, filling the deficiency of the memory of the user, reducing the possibility of memory omission and memory errors of the user, reducing mental burden of the user and improving the working efficiency and life quality; in addition, the question and answer result is output based on the query questions input by the user, a closed loop for memory preservation and memory extraction is formed, the use of the user can be facilitated, and the working and living efficiency of the user is improved.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. In a preferred embodiment of the present application, the electronic device 6 may include a memory 61, at least one processor 62, and at least one communication bus 63.
In this embodiment, the electronic device 6 including the memory 61, the at least one processor 62 and the at least one communication bus 63 may obtain video data of a scene where the user is located from other devices, for example, from an image capturing device, for example, obtain audio data of a scene where the user is located from an audio capturing device.
In some embodiments, the memory 61 has stored therein a computer program which, when executed by the at least one processor 62, performs all or part of the steps in the artificial intelligence based memory preservation and memory retrieval method as described.
In some embodiments, the at least one processor 62 is a Control Unit (Control Unit) of the electronic device 6, connects the various components of the entire electronic device 6 using various interfaces and lines, and performs various functions of the electronic device 6 and processes data by running or executing programs or modules stored in the memory 61, and invoking data stored in the memory 61. For example, the at least one processor 62, when executing the computer program stored in the memory 61, implements all or part of the steps of the artificial intelligence based memory preservation and memory retrieval method described in embodiments of the present application.
The at least one communication bus 63 is arranged to enable a connection communication between the memory 61 and the at least one processor 62 etc.
Referring to fig. 7, a schematic structural diagram of another electronic device according to an embodiment of the present application is shown. The electronic device 7 may include a memory 71, at least one processor 72, a camera 73, a microphone 74, a speaker 75, a screen 76, and at least one communication bus 77, among others.
The electronic device 7 may include a user-side device, where the user-side device includes, but is not limited to, any electronic product that can interact with a user by using a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and so on.
In some embodiments, the memory 71 has stored therein a computer program which, when executed by the at least one processor 72, performs all or part of the steps in the artificial intelligence based memory preservation and memory retrieval method as described.
In some embodiments, the at least one processor 72 is a Control Unit (Control Unit) of the electronic device 7, connects the various components of the entire electronic device 7 using various interfaces and lines, and performs various functions of the electronic device 7 and processes data by running or executing programs or modules stored in the memory 71, and invoking data stored in the memory 71. For example, the at least one processor 72, when executing the computer programs stored in the memory 71, implements all or part of the steps of the artificial intelligence based memory preservation and memory retrieval method described in embodiments of the present application.
The camera 73 is configured to collect video data of a scene in which a user is located and transmit the video data to the at least one processor 72.
The microphone 74 is configured to collect audio data of a scene in which the user is located and transmit the audio data to the at least one processor 72.
The speaker 75 is used for playing various voice information output by the electronic device 7.
The screen 76 may display various kinds of information output from the electronic device 7, and may also be used to receive a touch operation by a user or the like.
The at least one communication bus 77 is arranged to enable connection communication between the memory 71, the at least one processor 72, the camera 73, the microphone 74, the loudspeaker 75, the screen 76, etc.
It should be understood by those skilled in the art that the configuration of the electronic device shown in fig. 6 and 7 is not limited to the embodiment of the present application, and may be a bus type configuration, a star type configuration, or other hardware or software, or a different arrangement of components, of the electronic device 6 and 7 may be included more or less than those illustrated.
It should be noted that the electronic device 6 and the electronic device 7 are only examples, and other electronic products that may be present in the present application or may be present in the future are also included in the scope of the present application by way of reference.
Although not shown, the electronic device 6 and the electronic device 7 may further comprise a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the processor through a power management device, for example, the at least one processor 62 or the at least one processor 72, so as to perform functions of managing charging, discharging, and power consumption management through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 6 and the electronic device 7 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing an electronic device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to perform portions of the methods described in the various embodiments of the application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Claims (10)

1. A memory preservation and memory extraction method based on artificial intelligence, the method comprising:
collecting video data and audio data of a scene where a user is located;
processing the video data to obtain image information, and processing the audio data to obtain audio information;
classifying and identifying the image information and the audio information to obtain a memory original text;
Calling a large language model to process the memory original text, and storing the processed memory abstract and the memory original text in a database;
and when the query questions of the user are received, querying in the database and outputting memory abstracts corresponding to the query questions.
2. The method for artificial intelligence based memory preservation and retrieval according to claim 1, wherein said processing said video data to obtain image information comprises:
the video data are subjected to dynamic framing acquisition by combining a scene transformation detection algorithm and a rate prediction algorithm, so that a plurality of image data are obtained;
performing content segmentation on each image data to obtain image data blocks;
and carrying out image recognition on the image data block to obtain the image information.
3. The method for artificial intelligence based memory preservation and memory extraction of claim 2, wherein the combining the scene change detection algorithm and the rate prediction algorithm to dynamically frame the video data to obtain a plurality of image data comprises:
performing scene detection on the video data by using a scene change detection algorithm to obtain a video scene type;
Performing adaptive transformation rate prediction on video data corresponding to each video scene type by using a rate prediction algorithm;
when the predicted transformation rate is higher than a preset rate threshold, performing frame rate acquisition on the video data by adopting a first preset frame rate to obtain a plurality of image data corresponding to the video scene type;
when the predicted transformation rate is lower than the preset rate threshold, performing frame rate acquisition on the video data by adopting a second preset frame rate to obtain a plurality of image data corresponding to the video scene type;
wherein the first preset frame rate is greater than the second preset frame rate.
4. The artificial intelligence based memory preservation and memory extraction method of any one of claims 1 to 3, wherein the processing the audio data to obtain audio information comprises:
the audio data are acquired in frames to obtain a plurality of sub-audio data;
detecting whether the acquisition scene of the audio data is transformed according to a scene transformation detection algorithm;
when the acquisition scene of the audio data is transformed, performing scene classification on the sub-audio data of which the acquisition scene is transformed to obtain an audio scene type;
Carrying out audio layering on each sub-audio data to obtain layered audio;
and carrying out audio recognition on the layered audio to obtain the audio information.
5. The artificial intelligence based memory preservation and retrieval method according to claim 4, wherein the classifying and recognizing the image information and the audio information to obtain the memory original text comprises:
and carrying out classification and identification on the image information to obtain an image text, carrying out classification and identification on the audio information to obtain an audio text, and carrying out semantic association on the image text and the audio text to obtain the memory original text.
6. The artificial intelligence based memory preservation and retrieval method of claim 5, wherein the semantically associating the image text and the audio text comprises:
the image text and the audio text are semantically associated based on scene or time or place or subject to structurally merge the image text and the audio text.
7. The artificial intelligence based memory preservation and retrieval method of claim 6, further comprising:
Classifying, compressing and storing the corresponding image information according to the video scene type; and
And storing the audio scene type and the corresponding audio information.
8. The method for artificial intelligence based memory preservation and memory extraction according to claim 7, wherein when the query question is a voice query question inputted by the user in a voice form, the querying and outputting a memory digest corresponding to the query question in the database comprises:
performing voice recognition on the voice query problem to obtain a text query problem;
and inquiring and outputting a memory abstract corresponding to the text inquiry problem in the database.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the artificial intelligence based memory preservation and memory retrieval method of any one of claims 1 to 8 when the computer program is executed.
10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the artificial intelligence based memory preservation and memory retrieval method of any one of claims 1 to 8.
CN202311051700.9A 2023-08-18 2023-08-18 Memory preservation and memory extraction method based on artificial intelligence and related equipment Pending CN117033556A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311051700.9A CN117033556A (en) 2023-08-18 2023-08-18 Memory preservation and memory extraction method based on artificial intelligence and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311051700.9A CN117033556A (en) 2023-08-18 2023-08-18 Memory preservation and memory extraction method based on artificial intelligence and related equipment

Publications (1)

Publication Number Publication Date
CN117033556A true CN117033556A (en) 2023-11-10

Family

ID=88624238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311051700.9A Pending CN117033556A (en) 2023-08-18 2023-08-18 Memory preservation and memory extraction method based on artificial intelligence and related equipment

Country Status (1)

Country Link
CN (1) CN117033556A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070908A (en) * 2024-04-22 2024-05-24 南京凯奥思数据技术有限公司 Large model question-answering method, system and storage medium based on history dialogue record optimization

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070908A (en) * 2024-04-22 2024-05-24 南京凯奥思数据技术有限公司 Large model question-answering method, system and storage medium based on history dialogue record optimization

Similar Documents

Publication Publication Date Title
CN110519636B (en) Voice information playing method and device, computer equipment and storage medium
CN116484318B (en) Lecture training feedback method, lecture training feedback device and storage medium
CN110265040A (en) Training method, device, storage medium and the electronic equipment of sound-groove model
CN110189754A (en) Voice interactive method, device, electronic equipment and storage medium
US11355099B2 (en) Word extraction device, related conference extraction system, and word extraction method
WO2005071665A1 (en) Method and system for determining the topic of a conversation and obtaining and presenting related content
GB2523635A (en) Audio and video synchronizing
CN113392687A (en) Video title generation method and device, computer equipment and storage medium
CN117033556A (en) Memory preservation and memory extraction method based on artificial intelligence and related equipment
CN115798459B (en) Audio processing method and device, storage medium and electronic equipment
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
WO2024140430A1 (en) Text classification method based on multimodal deep learning, device, and storage medium
CN114138960A (en) User intention identification method, device, equipment and medium
CN110111795B (en) Voice processing method and terminal equipment
CN113903338A (en) Surface labeling method and device, electronic equipment and storage medium
CN113573128A (en) Audio processing method, device, terminal and storage medium
WO2022180860A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN113538645A (en) Method and device for matching body movement and language factor of virtual image
JP6838739B2 (en) Recent memory support device
CN114666307B (en) Conference interaction method, conference interaction device, equipment and storage medium
US11889168B1 (en) Systems and methods for generating a video summary of a virtual event
CN116721662B (en) Audio processing method and device, storage medium and electronic equipment
US20240194200A1 (en) System and method for change point detection in multi-media multi-person interactions
CN113407765B (en) Video classification method, apparatus, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination