CN118939831A - Natural language interaction retrieval intelligent security system based on large model - Google Patents

Natural language interaction retrieval intelligent security system based on large model Download PDF

Info

Publication number
CN118939831A
CN118939831A CN202411421116.2A CN202411421116A CN118939831A CN 118939831 A CN118939831 A CN 118939831A CN 202411421116 A CN202411421116 A CN 202411421116A CN 118939831 A CN118939831 A CN 118939831A
Authority
CN
China
Prior art keywords
information
analysis
query
video data
platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411421116.2A
Other languages
Chinese (zh)
Inventor
杨恒
吴永杰
龙涛
闫禹杭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Aimo Technology Co ltd
Original Assignee
Shenzhen Aimo Technology Co ltd
Filing date
Publication date
Application filed by Shenzhen Aimo Technology Co ltd filed Critical Shenzhen Aimo Technology Co ltd
Publication of CN118939831A publication Critical patent/CN118939831A/en
Pending legal-status Critical Current

Links

Abstract

The application discloses a natural language interaction retrieval intelligent security system based on a large model, which comprises an analysis platform and an interaction platform; the analysis platform analyzes and processes the video data to obtain analysis information corresponding to the video data; the interaction platform is automatically built based on the robot flow; the interaction platform is configured to convert query information of a user into a query instruction by utilizing the large model, and obtain query content, target time and target place of the user based on the query instruction; the interaction platform acquires analysis information matched with the target time and the target place from the analysis platform, performs similarity calculation on query content and the analysis information acquired by matching, and outputs a query result. The application can remarkably improve the speed and accuracy of monitoring data retrieval, thereby meeting the requirement of large-scale video data processing; meanwhile, the application can greatly simplify the operation flow of the user, improve the operation convenience and enable the system to have more user-friendliness and intelligent level.

Description

Natural language interaction retrieval intelligent security system based on large model
Technical Field
The application relates to the technical field of intelligent security, in particular to a natural language interaction retrieval intelligent security system based on a large model.
Background
With the wide application of video monitoring technology, conventional security systems face many challenges in data retrieval and processing. Existing systems typically rely on manual operations and basic search algorithms, resulting in inefficiency and slow response when processing large amounts of video data, particularly with significant shortcomings in real-time monitoring data analysis and user interaction. These problems are a strong need for a more efficient, intelligent solution to improve the performance of security systems.
Disclosure of Invention
The application aims to provide a natural language interactive retrieval intelligent security system based on a large model, which solves the technical problem of insufficient interactivity caused by low efficiency when the existing security system processes video data in the prior art. The preferred technical solutions of the technical solutions provided by the present application can produce a plurality of technical effects described below.
In order to achieve the above purpose, the present application provides the following technical solutions:
The application provides a natural language interactive retrieval intelligent security system based on a large model, which comprises an analysis platform and an interaction platform; the analysis platform analyzes and processes the video data to obtain analysis information corresponding to the video data; wherein the video data includes a plurality of image frames, and the analysis information includes general information for describing the plurality of image frames and specific information; the interaction platform is automatically built based on a robot flow; the interaction platform is configured to convert query information of a user into a query instruction by utilizing the large model, and obtain query content, target time and target place of the user based on the query instruction; and the interaction platform acquires analysis information matched with the target time and the target place from the analysis platform, performs similarity calculation on the query content and the analysis information acquired by matching, and outputs a query result.
In some embodiments, the analysis platform comprises an acquisition module, a detection module, and an analysis module; the acquisition module is configured to acquire video data containing a plurality of objects; the detection module is configured to detect the video data and acquire a key frame corresponding to each object; the analysis module is configured to analyze each key frame and acquire the general information and the specific information corresponding to the key frame.
In some embodiments, the detection module is configured to acquire objects in each image frame within the video data and generate a confidence probability and bounding box for each object.
In some embodiments, the detection module is configured to obtain, based on a motion state of each object, a keyframe corresponding to each object in combination with time-series data, where the keyframe is an image frame with the highest confidence probability.
In some embodiments, the analysis module is configured to perform overall picture analysis on each of the key frames, and obtain a general description as the general information; the analysis module is configured to perform independent analysis on each bounding box and acquire object descriptions as the specific information.
In some embodiments, the analysis module is configured to vectorize the general information and the specific information, respectively, to obtain general vector information and specific vector information.
In some embodiments, the analysis module performs analysis processing on the video data based on a plurality of model prompt words of different categories to obtain the analysis information.
In some embodiments, the interaction platform is configured to perform vectorization processing on the query content, perform semantic similarity calculation on general vector information and the specific vector information corresponding to the key frames by using the large model and the vectorized query content, and screen out at least one key frame as the query result.
In some embodiments, the query information of the user is voice information; the interactive platform is further configured to convert the voice information into text information using the large model and obtain the query instruction based on the text information.
In some embodiments, a camera module for providing the video data is further included; the interaction platform is configured to screen a camera operation command from the query instruction, and acquire the video data from the camera module based on the camera operation command as the query result or send the video data to the analysis platform.
By implementing one of the technical schemes, the application has the following advantages or beneficial effects: in the application, the analysis processing is carried out on the video data by constructing the analysis platform, and the general information and the specific information for describing the image frames of the video data are obtained, so that the video data, such as the speed and the accuracy of monitoring data retrieval, can be obviously improved, and the requirement of large-scale video data processing can be met; meanwhile, the application converts the query information of the user into the query instruction by constructing the interactive platform, and outputs the query result by calculating the similarity with the analysis information, thereby improving the operation efficiency and user experience of the system; in addition, the interactive platform is automatically built based on the robot flow, so that the user operation flow can be greatly simplified, the operation convenience is improved, and the system has higher user friendliness and intelligent level.
Drawings
For a clearer description of the technical solutions of embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art, in which:
FIG. 1 is a block diagram of a large model-based natural language interactive retrieval intelligent security system in accordance with an embodiment of the present application;
FIG. 2 is a schematic flow diagram of matching video data and query information of a large model-based natural language interactive retrieval intelligent security system according to an embodiment of the application;
FIG. 3 is a flow diagram of query instruction processing according to an embodiment of the present application;
fig. 4 is a block diagram of the processing apparatus according to the embodiment of the present application.
In the figure: 1. natural language interaction retrieval intelligent security system based on large model; 10. an analysis platform; 11. an acquisition module; 12. a detection module; 13. an analysis module; 20. an interaction platform; 30. a camera module; 4. a processing device; 40. a memory; 41. a processor.
Detailed Description
For a better understanding of the objects, technical solutions and advantages of the present application, reference should be made to the various exemplary embodiments described hereinafter with reference to the accompanying drawings, which form a part hereof, and in which are described various exemplary embodiments which may be employed in practicing the present application. The same reference numbers in different drawings identify the same or similar elements unless expressly stated otherwise. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. It is to be understood that they are merely examples of processes, methods, apparatuses, etc. that are consistent with certain aspects of the present disclosure as detailed in the appended claims, other embodiments may be utilized, or structural and functional modifications may be made to the embodiments set forth herein without departing from the scope and spirit of the present disclosure.
In the description of the present application, it should be understood that the terms "center," "longitudinal," "transverse," and the like are used in an orientation or positional relationship based on that shown in the drawings, and are merely for convenience in describing the present application and to simplify the description, rather than to indicate or imply that the elements referred to must have a particular orientation, be constructed and operate in a particular orientation. The terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. The term "plurality" means two or more. The terms "connected," "coupled" and "connected" are to be construed broadly and may be, for example, fixedly connected, detachably connected, integrally connected, mechanically connected, electrically connected, communicatively connected, directly connected, indirectly connected via intermediaries, or may be in communication with each other between two elements or in an interaction relationship between the two elements. The term "and/or" includes any and all combinations of one or more of the associated listed items. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances.
In order to illustrate the technical solutions of the present application, the following description is made by specific embodiments, only the portions related to the embodiments of the present application are shown.
The application provides a natural language interactive retrieval intelligent security system 1 based on a large model. The natural language interaction retrieval intelligent security system 1 based on the large model can acquire the query information of the user and output the corresponding query result. The large-model-based natural language interaction retrieval intelligent security system 1 can output corresponding query results based on video data. The large-model-based natural language interaction retrieval intelligent security system 1 can acquire query information of a user, compare the query information of the user with analysis information of video data, and acquire and output corresponding query results.
In some embodiments, as shown in fig. 1, a large model-based natural language interactive retrieval intelligent security system 1 may include an analysis platform 10 and an interaction platform 20. The analysis platform 10 may be used to obtain analysis information of the video data; the interaction platform 20 may be configured to obtain query information of a user and output a corresponding query result based on the analysis information.
In some embodiments, the analysis platform 10 may perform analysis processing on the video data to obtain analysis information corresponding to the video data; wherein the video data may include a plurality of image frames, and the analysis information may include general information for describing the plurality of image frames and specific information.
In some embodiments, analysis platform 10 may include an acquisition module 11, a detection module 12, and an analysis module 13. In some embodiments, the acquisition module 11 may be configured to acquire video data containing a plurality of objects; the detection module 12 may be configured to detect the video data, and obtain a key frame corresponding to each object; in some embodiments, the analysis module 13 may be configured to analyze each key frame to obtain general information and specific information corresponding to the key frame.
In some embodiments, the video data may include real-time video data and cached video data. The acquisition module 11 may acquire real-time video data from a camera or a monitoring device. The cached video data may refer to video data cached in a local device or a network device, or may be video data uploaded by a user and retrieved.
In some embodiments, the video data may include a plurality of image frames. Multiple objects may be included in an image frame. The object may include an article, a person, a vehicle, or the like. For example, for real-time video data of a security camera, the plurality of objects may be characters, objects, etc. that are pointed out as being present in the real-time video data.
In some embodiments, the detection module 12 may be configured to acquire objects in each image frame within the video data and generate a confidence probability and bounding box for each object.
In some embodiments, the detection module 12 may process each image frame in turn using an object detection algorithm, detect each object in the picture and generate a confidence probability and Bounding Box (BBOX) for each object. The object detection algorithm may include the steps of: feature extraction, for example, extracting features by using a CNN (convolutional neural network), regenerating a candidate region, for example, generating the candidate region based on whether an object exists through a region proposal network (Region Proposal Networks, RPN), classifying the candidate region, and acquiring a bounding box.
In some embodiments, the object detection algorithm may be one of a YOLO algorithm, an SSD (single-stage detector) algorithm, a fast R-CNN (fast area convolutional neural network) algorithm, or a CENTERNET (central point network) algorithm.
In some embodiments, the detection module 12 may also utilize a deep learning model for real-time object detection and tracking, obtaining confidence probabilities and bounding boxes for each object.
In some embodiments, the detection module 12 may be configured to obtain, based on the motion state of each object, a keyframe corresponding to each object in combination with the time-series data, where the keyframe may be an image frame with the highest confidence probability.
In some embodiments, the detection module 12 may predict the motion state of the object using Kalman filtering. The kalman filter can be obtained by: defining state variables, constructing a motion equation based on the state variables, initializing a state step and predicting the motion state of the object based on the motion equation.
In some embodiments, the detection module 12 may predict the motion state of the object using an extended kalman filter or an unscented kalman filter.
In some embodiments, each object may correspond to at least one key frame, i.e., at least one image frame may be selected from the video data as a key frame. The key frame may represent the best detection instant of the object in the video data.
In some embodiments, the analysis module 13 may be configured to analyze each key frame to obtain analysis information, where the analysis information may include general information and specific information.
In some embodiments, the analysis module 13 may analyze each key frame using a visual-Language Model (VLM) to obtain analysis information. The vision-language model is a deep learning model that combines computer vision and natural language processing techniques for the purpose of understanding and generating associations between images and text. The vision-language model may include at least one of a VGG network (VGG convolutional neural network) +rnn (recurrent neural network) model, a fast R-CNN (fast region convolutional neural network) model+an LSTM (long short term memory network) model.
In some embodiments, the analysis module 13 may be configured to perform overall screen analysis on each key frame, and obtain a general description as general information. The general description may be text information including at least the whole picture information in the key frame. For example, the general description may be "this picture is an indoor scene taken by one monitoring camera, showing two people. One woman is entering from the right side of the picture and the other woman stands behind the counter on the left side. The interior decoration is simple, and the wall and the ground are white. The left counter has some items including beverages and office supplies. It should be noted that the foregoing examples are merely illustrative of the general description and are not meant to limit the embodiments of the application.
In some embodiments, the analysis module 13 may be configured to analyze each bounding box independently, obtaining the object description as specific information. The analysis module 13 may perform independent analysis on the bounding box in each key frame to obtain a plurality of specific information. Independent analysis may refer to the one-to-one correspondence of the acquired object descriptions to bounding boxes. For example, the bounding box A1 in the key frame a and the bounding box B1 in the key frame B may be independently analyzed, and the object description corresponding to the bounding box A1 and the object description corresponding to the bounding box B1 may be acquired as specific information.
In some embodiments, the analysis module 13 may perform independent analysis on all or part of the bounding boxes in the keyframes to obtain a plurality of specific information. In some embodiments, a key frame may include a plurality of specific information.
In some embodiments, the object description may be text information including at least object information in a key frame. For example, the object description may be "this is a blurred surveillance video shot, showing a person sideways. The person appears to wear a light coat and dark pants with hair tie Cheng Mawei braids. It should be noted that the foregoing examples are merely illustrative of the present application and are not to be construed as limiting the embodiments of the present application.
In some embodiments, the analysis module 13 may be configured to vectorize the general information and the specific information, respectively, to obtain general vector information and specific vector information.
In some embodiments, the analysis module 13 may utilize Embedding algorithm (embedded algorithm) to separately vector the general information and the specific information. The Embedding algorithm can be used to map data in a high-dimensional space to a low-dimensional space while preserving information and structure in the original data. That is, the generic vector information may be a vectorized representation of the generic information and the specific vector information may be a vectorized representation of the specific information.
In some embodiments, embedding algorithms may be at least one of Word2Vec algorithms (Word vector algorithms), gloVe algorithms (global vector Word representation algorithms), and Node2Vec algorithms (Node vector algorithms).
In some embodiments, the large model-based natural language interaction retrieval intelligent security system 1 may also include a storage module. The storage module may be configured to store general vector information, specific vector information, time information, and location information for each key frame.
In some embodiments, the time information may be obtained based on time series data. In some embodiments, as shown in fig. 2, general vector information, specific vector information, time information, and location information of one key frame may be stored as a set of data information of video data. That is, the storage module may store multiple sets of data information of video data. In some embodiments, data information associated with a key frame may be obtained based on the key frame, or a corresponding key frame may be obtained from a storage module based on the data information, e.g., based on time information and/or location information.
In some embodiments, the storage module may be a database. For example, the storage module may be one of a SQLite database, HSQLDB database, SQLCIPHER database, or MemSQL database.
In some embodiments, the analysis module 13 may analyze the video data based on a plurality of model prompt words of different categories to obtain analysis information. Under the condition, the same scene is analyzed for multiple times by utilizing a visual-language model through a plurality of different angles, namely through model prompt words in different types of construction analysis, so that comprehensive picture description is obtained, a single application scene can be separated, and the full-content general query under different scenes is realized.
In some embodiments, the interactive platform 20 may be configured to convert the query information of the user into a query instruction using the large model, and obtain the query content, the target time, and the target location of the user based on the query instruction.
In some embodiments, the query information of the user may be text information. The interactive platform 20 may utilize the LLM large language model to convert text information into query instructions. The query instructions may have a preset format, i.e., the interactive platform 20 may convert the text information into formatted query instructions using the LLM large language model. The interactive platform 20 may obtain the query instruction by extracting and formatting the text information.
In some embodiments, the user's query information may be voice information. The interactive platform 20 may also be configured to convert the voice information into text information using a large model and obtain query instructions based on the text information. In some embodiments, the user's speech information may be converted to text information by an automatic speech recognition (Automatic Speech Recognition, ASR) model. The high-precision ASR model is selected, so that the accuracy of converting the voice into the text can be improved.
In some embodiments, the interaction platform 20 may obtain analysis information matching the target time and the target location from the analysis platform 10, perform similarity calculation on the query content and the analysis information obtained by matching, and output the query result. The similarity calculation may refer to a semantic similarity calculation.
In some embodiments, the interaction platform 20 may be configured to vectorize the query content, obtain a plurality of key frames based on the target time and the target location, utilize the large model to perform semantic similarity calculation on general vector information and specific vector information corresponding to the key frames and the vectorized query content, and screen out at least one key frame as the query result.
In some embodiments, the interactive platform 20 may perform embedding vectorization on the query content to obtain the query content that is vectorized. The interactive platform 20 may obtain corresponding data information and key frames corresponding to the data information based on the query information of the user. Specifically, the interaction platform 20 may acquire data information and key frames corresponding to the target location and the target time based on the target location and the target time of the user. For example, the target location of the user is "company foreground", the target time of the user is "12 to 14 points", the interactive platform 20 may obtain the time information in the data information as 12 points, 13 points or 14 points, and the location information as a plurality of key frames of the company foreground. It should be noted that the above time information and the location information are only examples.
In some embodiments, the interaction platform 20 may perform semantic similarity calculations on the data information of each key frame and the vectorized query content. In some embodiments, the interaction platform 20 may rank the computing results from large to small according to relevance and select at least one keyframe as the query result based on the relevance size.
In some embodiments, the large model-based natural language interactive retrieval intelligent security system 1 may also include a camera module 30 for providing video data. The interactive platform 20 may be further configured to screen the camera operation command from the query command, and acquire video data from the camera module 30 as a query result based on the camera operation command, or transmit the video data to the analysis platform 10.
In some embodiments, camera module 30 may include a plurality of cameras for acquiring video data. The location information in the video data may be a position where the camera is located or a position where the camera is photographed.
In some embodiments, interaction platform 20 may categorize query instructions. Specifically, the interaction platform 20 may divide the query instruction into a camera related instruction and an irrelevant instruction, where the camera related instruction may refer to instruction information that needs to acquire video data through the camera module 30.
In some embodiments, interactive platform 20 may convert camera-related instructions into camera operation commands, such as playback or retrieval of video data, etc., based on a large language model. The large language model may have special system prompt words, i.e. the interaction platform 20 may extract camera related instructions from the semantic level, resulting in camera operation commands.
In some embodiments, interactive platform 20 may acquire video data of camera module 30 in real-time via RTSP protocol based on camera operation commands. The RTSP (Real-TIME STREAMING Protocol) Protocol is a network Protocol that can be used to control interactions between a streaming server and a client.
In some embodiments, the interactive platform 20 may be built based on robotic process automation (RPA technology). RPA technology (Robotic Process Automation) is a technology that uses software robotics to simulate and integrate human interactions in digital systems. RPA techniques may be used for automating the tasks of repeatability and regularity, for example, the interactive platform 20 may sort query instructions based on the RPA techniques and obtain video data from the camera module 30.
In some embodiments, the interaction platform 20 may include an AI (artificial intelligence) agent, which may be integrated through a plurality of large language models. Specifically, the AI agent may include a first large language model, a second large language model, and a third large language model. Specifically, as shown in fig. 3, the first large language model may be used to divide the query instruction into a camera related instruction and an irrelevant instruction, and the camera related instruction may include a camera operation instruction; the second large language model can convert camera related instructions into camera operation commands, and the camera operation commands can be obtained through formatting; the third largest language model may be used to generate text results.
In some embodiments, the query results may include at least one of image results, text results, and voice results.
In some embodiments, the image result may be at least one key frame. In some embodiments, the text results may be reply text generated based on the key frame's data information and the user's query information. For example, the reply text may be "the person on the image is a person on a white coat appearing in the company's foreground at two pm 24 a 7 th 2024".
In some embodiments, the text result may also be a reply text generated based on the query instruction and the camera operation command. For example, the reply text may be "a video of a monitoring video of a company foreground at two pm 24 d after 2024 played back".
In some embodiments, interactive platform 20 may convert Text results to Speech results via a TTS (Text-to-Speech) model. In this case, the voice feedback function can be enhanced, so that the user interaction experience and the usability of the system can be improved.
In the application, the analysis platform 10 is constructed to analyze the video data to acquire the general information and the specific information for describing the image frames of the video data, so that the speed and the accuracy of monitoring data, such as monitoring data retrieval, can be remarkably improved, and the requirement of large-scale video data processing can be met; meanwhile, the application converts the query information of the user into the query instruction by constructing the interactive platform 20, and outputs the query result by calculating the similarity with the analysis information, thereby improving the operation efficiency and the user experience of the system; in addition, the interactive platform 20 is automatically built based on the robot process, so that the user operation process can be greatly simplified, the operation convenience is improved, and the system has higher user friendliness and intelligence level.
The application also comprises a security interaction method, which can comprise the following steps:
Analyzing and processing the video data to obtain analysis information corresponding to the video data; wherein the video data may include a plurality of image frames, and the analysis information may include general information for describing the plurality of image frames and specific information;
Converting the query information of the user into a query instruction by utilizing the large model, and obtaining query content, target time and target place of the user based on the query instruction;
And obtaining analysis information matched with the target time and the target place, performing similarity calculation on the query content and the analysis information obtained by matching, and outputting a query result.
Those of ordinary skill in the art will appreciate that all or part of the features/steps of the method embodiments described above may be implemented by a method, a data processing system, or a computer program, and that the features may be implemented in a manner that is not hardware, in a manner that is software, or in a combination of hardware and software. The foregoing computer program may be stored in one or more computer readable storage media, where the computer program is stored, and when the computer program is executed (e.g., by a processor), the steps including the foregoing security interaction method embodiments are performed.
The aforementioned storage medium that can store the program code includes: static hard disk, solid state disk, random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), optical storage, magnetic storage, flash memory, magnetic or optical disk, and/or combinations thereof, may be implemented by any type of volatile or nonvolatile storage device or combination thereof.
As shown in fig. 4, the present application also provides an embodiment of a processing device 4, including one or more processors 41 and a memory 40; the memory 40 is configured to store one or more computer programs, and the one or more processors 41 are configured to execute the one or more computer programs stored in the memory 40, so that the processor 41 performs the features/steps of the embodiment of the security interaction method described above.
The foregoing is only illustrative of the preferred embodiments of the application, and it will be appreciated by those skilled in the art that various changes in the features and embodiments may be made and equivalents may be substituted without departing from the spirit and scope of the application. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the application without departing from the essential scope thereof. Therefore, it is intended that the application not be limited to the particular embodiment disclosed, but that the application will include all embodiments falling within the scope of the appended claims.

Claims (10)

1. The natural language interaction retrieval intelligent security system based on the large model is characterized by comprising an analysis platform and an interaction platform;
The analysis platform analyzes and processes the video data to obtain analysis information corresponding to the video data; wherein the video data includes a plurality of image frames, and the analysis information includes general information for describing the plurality of image frames and specific information;
the interaction platform is automatically built based on a robot flow; the interaction platform is configured to convert query information of a user into a query instruction by utilizing the large model, and obtain query content, target time and target place of the user based on the query instruction;
And the interaction platform acquires analysis information matched with the target time and the target place from the analysis platform, performs similarity calculation on the query content and the analysis information acquired by matching, and outputs a query result.
2. The large model-based natural language interactive retrieval intelligent security system of claim 1, wherein the analysis platform comprises an acquisition module, a detection module and an analysis module;
the acquisition module is configured to acquire video data containing a plurality of objects;
the detection module is configured to detect the video data and acquire a key frame corresponding to each object;
the analysis module is configured to analyze each key frame and acquire the general information and the specific information corresponding to the key frame.
3. The large model based natural language interactive retrieval intelligent security system of claim 2, wherein the detection module is configured to obtain objects in each image frame within the video data and generate a confidence probability and bounding box for each object.
4. The large model-based natural language interaction retrieval intelligent security system according to claim 3, wherein the detection module is configured to obtain a keyframe corresponding to each object in combination with time series data based on a motion state of each object, wherein the keyframe is an image frame with highest confidence probability.
5. The large model based natural language interactive retrieval intelligent security system of claim 3, wherein the analysis module is configured to perform overall picture analysis on each of the key frames to obtain a generic description as the generic information; the analysis module is configured to perform independent analysis on each bounding box and acquire object descriptions as the specific information.
6. The large model-based natural language interactive retrieval intelligent security system of claim 2, wherein the analysis module is configured to vectorize the general information and the specific information respectively to obtain general vector information and specific vector information.
7. The large model-based natural language interactive retrieval intelligent security system according to claim 2, wherein the analysis module analyzes the video data based on a plurality of model prompt words of different categories to obtain the analysis information.
8. The large model-based natural language interactive retrieval intelligent security system according to claim 6, wherein the interactive platform is configured to perform vectorization processing on the query content, perform semantic similarity calculation on general vector information and the specific vector information corresponding to the key frames by using the large model and the vectorized query content, and screen out at least one key frame as the query result.
9. The large model-based natural language interactive retrieval intelligent security system according to claim 1, wherein the query information of the user is voice information; the interactive platform is further configured to convert the voice information into text information using the large model and obtain the query instruction based on the text information.
10. The large model based natural language interactive retrieval intelligent security system of claim 1, further comprising a camera module for providing the video data; the interaction platform is configured to screen a camera operation command from the query instruction, and acquire the video data from the camera module based on the camera operation command as the query result or send the video data to the analysis platform.
CN202411421116.2A 2024-10-12 Natural language interaction retrieval intelligent security system based on large model Pending CN118939831A (en)

Publications (1)

Publication Number Publication Date
CN118939831A true CN118939831A (en) 2024-11-12

Family

ID=

Similar Documents

Publication Publication Date Title
CN109919031B (en) Human behavior recognition method based on deep neural network
Sumon et al. Violence detection by pretrained modules with different deep learning approaches
CN110235138B (en) System and method for appearance search
US8855363B2 (en) Efficient method for tracking people
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
CN108960059A (en) A kind of video actions recognition methods and device
US11620335B2 (en) Method for generating video synopsis through scene understanding and system therefor
Yu et al. Remotenet: Efficient relevant motion event detection for large-scale home surveillance videos
JP2017228224A (en) Information processing device, information processing method, and program
CN111339812A (en) Pedestrian identification and re-identification method based on whole or partial human body structural feature set, electronic equipment and storage medium
CN110991278A (en) Human body action recognition method and device in video of computer vision system
JP2019117556A (en) Information processing apparatus, information processing method and program
CN115690554A (en) Target identification method, system, electronic device and storage medium
Annapoorani et al. Blind-Sight: Object Detection with Voice Feedback
EP4287145A1 (en) Statistical model-based false detection removal algorithm from images
CN110674342B (en) Method and device for inquiring target image
CN118939831A (en) Natural language interaction retrieval intelligent security system based on large model
Supangkat et al. Moving Image Interpretation Models to Support City Analysis
Kavimandan et al. Human action recognition using prominent camera
CN114898287A (en) Method and device for dinner plate detection early warning, electronic equipment and storage medium
KR20230164384A (en) Method For Training An Object Recognition Model In a Computing Device
CN115063831A (en) High-performance pedestrian retrieval and re-identification method and device
CN112214626B (en) Image recognition method and device, readable storage medium and electronic equipment
Bodke et al. A Review Paper on Object-Detection using the DeepLearning Approach
Sheikh et al. Framework for deep learning based model for human activity recognition (HAR) using adapted PSRA6 dataset

Legal Events

Date Code Title Description
PB01 Publication