CN119577186A - Data retrieval method, device, computer equipment and medium based on artificial intelligence - Google Patents

Data retrieval method, device, computer equipment and medium based on artificial intelligence Download PDF

Info

Publication number
CN119577186A
CN119577186A CN202411630356.3A CN202411630356A CN119577186A CN 119577186 A CN119577186 A CN 119577186A CN 202411630356 A CN202411630356 A CN 202411630356A CN 119577186 A CN119577186 A CN 119577186A
Authority
CN
China
Prior art keywords
text
prediction
segment
representation
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411630356.3A
Other languages
Chinese (zh)
Inventor
唐小初
舒畅
陈远旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202411630356.3A priority Critical patent/CN119577186A/en
Publication of CN119577186A publication Critical patent/CN119577186A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7328Query by example, e.g. a complete video frame or video sequence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请属于人工智能技术领域,涉及一种基于人工智能的数据检索方法、装置、计算机设备及存储介质,包括:获取与目标视频对应的检索文本;对检索文本进行文本预处理得到文本嵌入表示;对目标视频进行图像特征提取得到图像特征;调用层叠指针网络中的交叉注意力层对文本嵌入表示与图像特征进行交叉注意力处理得到多个帧的特征表示;对多个帧的特征表示进行预测处理得到片段起始位置,并基于片段起始位置确定片段起始区间;基于文本嵌入表示对片段起始区间进行偏置预测处理得到预测结果;基于预测结果生成与检索文本对应的目标片段位置。此外,目标片段位置可存储于区块链中。本申请基于层叠指针网络的使用提高了视频数据检索的处理效率与准确性。

The present application belongs to the field of artificial intelligence technology, and relates to a data retrieval method, device, computer equipment and storage medium based on artificial intelligence, including: obtaining a retrieval text corresponding to a target video; performing text preprocessing on the retrieval text to obtain a text embedding representation; performing image feature extraction on the target video to obtain image features; calling the cross-attention layer in the cascade pointer network to perform cross-attention processing on the text embedding representation and the image features to obtain feature representations of multiple frames; performing prediction processing on the feature representations of multiple frames to obtain a fragment start position, and determining a fragment start interval based on the fragment start position; performing biased prediction processing on the fragment start interval based on the text embedding representation to obtain a prediction result; generating a target fragment position corresponding to the retrieval text based on the prediction result. In addition, the target fragment position can be stored in a blockchain. The present application improves the processing efficiency and accuracy of video data retrieval based on the use of a cascade pointer network.

Description

Data retrieval method, device, computer equipment and medium based on artificial intelligence
Technical Field
The application relates to the technical field of artificial intelligence development, financial science and technology and the field of medical health, in particular to a data retrieval method, a data retrieval device, computer equipment and a storage medium based on artificial intelligence.
Background
Today, where information technology is increasingly different, video is an important multimedia information carrier, and its application range has penetrated into aspects of social life, especially in the fields of finance and medical health, where video data plays an irreplaceable role in various aspects of investigation and evidence collection, education and training, customer service, public health propaganda, academic communication and research popularization, business marketing and brand construction, etc. In order to effectively manage and utilize these massive video resources, text-to-video time retrieval techniques have evolved. Text-to-video time retrieval technology is an advanced retrieval method that combines the advantages of text retrieval and video retrieval, allowing users to quickly and accurately locate video clips or specific times containing related content from a vast video library by entering keywords, phrases or questions. The technology greatly improves the retrieval efficiency of video information, enables a user to acquire the required key information in a short time, and has important significance for improving the working efficiency and decision accuracy in the financial field.
However, while text-to-video time-of-day retrieval techniques have many advantages, it still faces many challenges in practical applications. Currently, most of the mainstream text-video time retrieval methods are based on strategies of sliding window or anchor-base (anchor base). These methods perform well when dealing with simple or single video segment retrieval tasks, but often suffer from a significant performance penalty when looking at complex segment moment retrieval such as nesting. Specifically, the sliding window method slides on the video stream by setting a window with a fixed length, and performs text matching on video content in each window, but the method cannot accurately capture dynamic changes in the video and has limited identification capability on nested fragments. The anchor-base method relies on preset anchor points to locate key segments in the video, and although the accuracy of searching is improved to a certain extent, the adaptability of the method to complex and changeable video contents is still insufficient, and the searching result is easy to miss or misjudge. Therefore, when the existing text-video time retrieval technology processes a complex segment time retrieval task, the problems of low retrieval efficiency and difficult guarantee of accuracy generally exist.
Disclosure of Invention
The embodiment of the application aims to provide a data retrieval method, a device, computer equipment and a storage medium based on artificial intelligence, so as to solve the technical problems that the retrieval efficiency is low and the accuracy is difficult to guarantee when the conventional text-video time retrieval technology processes a complex segment time retrieval task.
In order to solve the technical problems, the embodiment of the application provides a data retrieval method based on artificial intelligence, which adopts the following technical scheme:
acquiring a search text corresponding to a target video, which is input by a user;
performing text preprocessing on the search text to obtain a corresponding text embedded representation;
extracting image features of the target video to obtain corresponding image features;
calling a preset laminated pointer network, and carrying out cross attention processing on the text embedded representation and the image characteristic based on a cross attention layer in the laminated pointer network to obtain a characteristic representation of a plurality of frames;
Predicting the characteristic representations of the frames to obtain corresponding fragment starting positions, and determining corresponding fragment starting intervals based on the fragment starting positions;
performing bias prediction processing on the segment starting interval based on the text embedding representation to obtain a corresponding prediction result;
and generating a target fragment position corresponding to the search text based on the prediction result.
Further, the step of performing text preprocessing on the search text to obtain a corresponding text embedded representation specifically includes:
word segmentation is carried out on the search text based on a preset word segmentation device, and a corresponding word segmentation result is obtained;
Calling a preset word embedding model;
converting the word segmentation result based on the word embedding model to obtain a corresponding embedded representation;
the embedded representation is taken as the text embedded representation.
Further, the step of extracting the image features of the target video to obtain corresponding image features specifically includes:
performing frame sampling on the target video to obtain a corresponding image frame;
calling a preset image encoder;
extracting features of the image frames based on the image encoder to obtain corresponding feature vectors;
and taking the characteristic vector as the image characteristic.
The step of predicting the characteristic representations of the frames to obtain corresponding segment start positions and determining corresponding segment start intervals based on the segment start positions specifically includes:
performing prediction processing of starting positions on the characteristic representations of the frames based on the linear layer to obtain probability values of the frames;
acquiring a preset probability threshold;
selecting a designated frame with a probability value larger than the probability threshold value from the frames, and taking the designated frame as the fragment starting position;
And determining a corresponding fragment starting interval based on the fragment starting position.
The step of performing bias prediction processing on the segment start interval based on the text embedded representation to obtain a corresponding prediction result specifically comprises the following steps:
Extracting first characteristic representations of all target frames contained in the segment start interval;
performing cross attention processing on the first characteristic representation and the text embedded representation to obtain a corresponding second characteristic representation;
Performing aggregation treatment on the second characteristic representation based on a preset pooling strategy to obtain a corresponding representation vector;
Mapping the representation vector to an offset prediction task based on the full connection layer to obtain a corresponding prediction confidence and a prediction offset value;
And taking the prediction confidence and the prediction bias value as a prediction result corresponding to the fragment start interval.
Further, the step of generating the target segment position corresponding to the search text based on the prediction result specifically includes:
screening out a first fragment starting interval with the prediction confidence coefficient larger than a preset confidence coefficient threshold value from all the fragment starting intervals;
acquiring a specified prediction bias value corresponding to the first segment start interval;
Performing corresponding position fine adjustment processing on the first segment starting interval based on the appointed prediction bias value to obtain a second segment starting interval after fine adjustment;
converting the second segment starting interval based on a preset target conversion form to obtain a processed third segment starting interval;
And generating a target fragment position corresponding to the search text based on the third fragment start interval.
Further, after the step of generating the target segment position corresponding to the search text based on the prediction result, the method further includes:
generating a corresponding search result based on the target segment position;
acquiring a data pushing mode corresponding to the user;
and pushing the search result to the user based on the data pushing mode.
In order to solve the technical problems, the embodiment of the application also provides a data retrieval device based on artificial intelligence, which adopts the following technical scheme:
The first acquisition module is used for acquiring search texts corresponding to the target videos and input by a user;
The preprocessing module is used for preprocessing the text of the search text to obtain a corresponding text embedded representation;
the extraction module is used for extracting image features of the target video to obtain corresponding image features;
The second processing module is used for calling a preset laminated pointer network, and carrying out cross attention processing on the text embedded representation and the image feature based on a cross attention layer in the laminated pointer network to obtain feature representations of a plurality of frames;
The first prediction module is used for performing prediction processing on the characteristic representations of the frames to obtain corresponding fragment starting positions, and determining corresponding fragment starting intervals based on the fragment starting positions;
The second prediction module is used for carrying out bias prediction processing on the segment start interval based on the text embedding representation to obtain a corresponding prediction result;
And the first generation module is used for generating a target fragment position corresponding to the search text based on the prediction result.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
acquiring a search text corresponding to a target video, which is input by a user;
performing text preprocessing on the search text to obtain a corresponding text embedded representation;
extracting image features of the target video to obtain corresponding image features;
calling a preset laminated pointer network, and carrying out cross attention processing on the text embedded representation and the image characteristic based on a cross attention layer in the laminated pointer network to obtain a characteristic representation of a plurality of frames;
Predicting the characteristic representations of the frames to obtain corresponding fragment starting positions, and determining corresponding fragment starting intervals based on the fragment starting positions;
performing bias prediction processing on the segment starting interval based on the text embedding representation to obtain a corresponding prediction result;
and generating a target fragment position corresponding to the search text based on the prediction result.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
acquiring a search text corresponding to a target video, which is input by a user;
performing text preprocessing on the search text to obtain a corresponding text embedded representation;
extracting image features of the target video to obtain corresponding image features;
calling a preset laminated pointer network, and carrying out cross attention processing on the text embedded representation and the image characteristic based on a cross attention layer in the laminated pointer network to obtain a characteristic representation of a plurality of frames;
Predicting the characteristic representations of the frames to obtain corresponding fragment starting positions, and determining corresponding fragment starting intervals based on the fragment starting positions;
performing bias prediction processing on the segment starting interval based on the text embedding representation to obtain a corresponding prediction result;
and generating a target fragment position corresponding to the search text based on the prediction result.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
The method comprises the steps of firstly obtaining search texts corresponding to target videos input by a user, then conducting text preprocessing on the search texts to obtain corresponding text embedded representations, conducting image feature extraction on the target videos to obtain corresponding image features, then calling a preset stacked pointer network, conducting cross attention processing on the text embedded representations and the image features based on cross attention layers in the stacked pointer network to obtain feature representations of a plurality of corresponding frames, conducting prediction processing on the feature representations of the frames to obtain corresponding segment starting positions, determining corresponding segment starting intervals based on the segment starting positions, conducting bias prediction processing on the segment starting intervals based on the text embedded representations to obtain corresponding prediction results, and finally generating target segment positions corresponding to the search texts based on the prediction results. According to the method, after the search text corresponding to the target video and input by a user is obtained, text pretreatment is carried out on the search text to obtain text embedded representation, image feature extraction is carried out on the target video to obtain image features, then cross attention processing is carried out on the text embedded representation and the image features based on the use of a stacked pointer network to obtain feature representations of a plurality of frames, further, the feature representations of the frames are processed, initial segment starting positions are firstly predicted, then bias prediction is accurately carried out, further, the purpose of quickly and accurately generating the target segment positions corresponding to the input search text can be achieved based on the obtained prediction results, the processing efficiency of video data search is improved, and the data accuracy of the target segment positions obtained through search is guaranteed.
Drawings
In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of an artificial intelligence based data retrieval method according to the present application;
FIG. 3 is a schematic diagram of one embodiment of an artificial intelligence based data retrieval device in accordance with the present application;
FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103, where the terminal device 101 may be a notebook 1011, a tablet 1012, or a cell phone 1013. Network 102 is the medium used to provide communication links between terminal device 101 and server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal device 101.
The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, and the terminal device 101 may be an electronic book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer III), an MP4 (Moving Picture Experts Group Audio LayerIV, moving picture experts compression standard audio layer IV) player, a laptop portable computer, a desktop computer, and the like, in addition to the notebook 1011, the tablet 1012, or the mobile phone 1013.
The server 103 may be a server providing various services, such as a background server providing support for pages displayed on the terminal device 101.
It should be noted that, the data retrieval method based on artificial intelligence provided by the embodiment of the application is generally executed by a server/terminal device, and correspondingly, the data retrieval device based on artificial intelligence is generally arranged in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow chart of one embodiment of an artificial intelligence based data retrieval method in accordance with the present application is shown. The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs. The data retrieval method based on the artificial intelligence provided by the embodiment of the application can be applied to any scene needing data retrieval, and can be applied to products of the scenes, such as data retrieval in the field of financial insurance. The artificial intelligence-based data retrieval method comprises the following steps:
Step S201, obtaining a search text corresponding to a target video input by a user.
In this embodiment, the electronic device (e.g., the server/terminal device shown in fig. 1) on which the data retrieval method based on artificial intelligence operates may acquire the retrieval text through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection. The execution subject of the present application may be a data retrieval system, or simply a system. There is some semantic match between the description in the search text entered by the user and the image content in the target video. For example, if the search text is "introduce car insurance purchase notice," then a segment in the video should contain a picture that the insurer has told the car insurance purchase notice.
And step S202, performing text preprocessing on the search text to obtain a corresponding text embedded representation.
In this embodiment, the text preprocessing is performed on the search text to obtain a specific implementation process of the corresponding text embedded representation, which will be described in further detail in the following specific embodiments, which will not be described herein.
And step S203, extracting image features of the target video to obtain corresponding image features.
In this embodiment, the specific implementation process of extracting the image features of the target video to obtain the corresponding image features is described in detail in the following specific embodiments, which will not be described herein.
Step S204, calling a preset cascade pointer network, and carrying out cross attention processing on the text embedded representation and the image feature based on a cross attention layer in the cascade pointer network to obtain feature representations of a plurality of frames.
In this embodiment, the stacked pointer network includes at least a cross-attention layer, a linear layer, and a full-connection layer. The cross-attention layer may specifically be a transducer model having a plurality of encoder layers. By using the image features as an input sequence of a transducer model, each feature vector corresponds to a sequence element. The text-embedded representation is then converted into a format that can be processed by the transducer model (e.g., position-coding added, packed into a sequence, etc.). Then in each layer of the Transformer, a cross-attention mechanism is used, where Q (Query) is the image feature, K (Key) and V (Value) are text embedded representations. This is typically achieved through a multi-headed attention mechanism. And subsequently, carrying out multi-layer cross attention processing to obtain the interactive characteristic representation. These representations still maintain the dimension of (total_frame, dim). In the process of extracting image features of the target video, feature vectors of all frames can be combined into a feature matrix, wherein the dimension is (total_frame, dim), and total_frame is the total frame number, and dim is the dimension of the feature vector.
Step S205, performing prediction processing on the feature representations of the frames to obtain corresponding segment start positions, and determining corresponding segment start intervals based on the segment start positions.
In this embodiment, the above-mentioned specific implementation process of predicting the feature representations of the multiple frames to obtain the corresponding segment start positions and determining the corresponding segment start intervals based on the segment start positions will be described in further detail in the following specific embodiments, which will not be described herein.
And S206, performing bias prediction processing on the segment start interval based on the text embedding representation to obtain a corresponding prediction result.
In this embodiment, the foregoing specific implementation process of performing bias prediction processing on the segment start interval based on the text embedded representation to obtain a corresponding prediction result will be described in further detail in the following specific embodiments, which will not be described herein.
And step S207, generating a target fragment position corresponding to the search text based on the prediction result.
In this embodiment, the above specific implementation process of generating the target segment position corresponding to the search text based on the prediction result will be described in further detail in the following specific embodiments, which will not be described herein.
The method comprises the steps of firstly obtaining search texts corresponding to target videos input by a user, then conducting text preprocessing on the search texts to obtain corresponding text embedded representations, conducting image feature extraction on the target videos to obtain corresponding image features, then calling a preset stacked pointer network, conducting cross attention processing on the text embedded representations and the image features based on cross attention layers in the stacked pointer network to obtain feature representations of a plurality of corresponding frames, conducting prediction processing on the feature representations of the frames to obtain corresponding segment starting positions, determining corresponding segment starting intervals based on the segment starting positions, conducting bias prediction processing on the segment starting intervals based on the text embedded representations to obtain corresponding prediction results, and finally generating target segment positions corresponding to the search texts based on the prediction results. According to the method, after the search text corresponding to the target video and input by a user is obtained, text pretreatment is carried out on the search text to obtain text embedded representation, image feature extraction is carried out on the target video to obtain image features, then cross attention processing is carried out on the text embedded representation and the image features based on the use of a stacked pointer network to obtain feature representations of a plurality of frames, further, the feature representations of the frames are processed, initial segment starting positions are firstly predicted, then bias prediction is accurately carried out, further, the purpose of quickly and accurately generating the target segment positions corresponding to the input search text can be achieved based on the obtained prediction results, the processing efficiency of video data search is improved, and the data accuracy of the target segment positions obtained through search is guaranteed.
In some alternative implementations, step S202 includes the steps of:
And performing word segmentation on the search text based on a preset word segmentation device to obtain a corresponding word segmentation result.
In this embodiment, the word segmentation may be performed on the search text by using a word segmentation device having a word segmentation function, and the search text may be split into token, that is, word segmentation results.
And calling a preset word embedding model.
In this embodiment, the selection of the Word embedding model is not particularly limited, and models such as Word2Vec and GloVe may be used.
And converting the word segmentation result based on the word embedding model to obtain a corresponding embedded representation.
In the present embodiment, the word segmentation result (token after segmentation) is converted into an embedded representation by performing conversion processing using a word embedding model. These embedded representations are typically vectors of fixed dimensions.
The embedded representation is taken as the text embedded representation.
The method comprises the steps of segmenting the search text based on a preset word segmentation device to obtain a corresponding word segmentation result, calling a preset word embedding model, converting the word segmentation result based on the word embedding model to obtain a corresponding embedded representation, and taking the embedded representation as the text embedded representation. According to the application, the word segmentation device is used for segmenting the search text to obtain the word segmentation result, and the word segmentation result is converted based on the word embedding model, so that the text preprocessing of the search text can be efficiently and accurately completed, and the accuracy and normalization of the generated text embedding representation are improved.
In some alternative implementations of the present embodiment, step S203 includes the steps of:
and carrying out frame sampling on the target video to obtain a corresponding image frame.
In this embodiment, the video processing library may be used to read the target video, and frame sample the target video at a specified sampling rate (e.g., two frames per second), so as to obtain the corresponding image frame.
Calling a preset image encoder.
In this embodiment, the selection of the image encoder is not particularly limited, and models such as CLIP, BLIP2, etc. may be used.
And extracting the characteristics of the image frames based on the image encoder to obtain corresponding characteristic vectors.
In the present embodiment, feature extraction is performed on the sampled image frame by using an image encoder to convert the image frame into a feature vector of a fixed dimension.
And taking the characteristic vector as the image characteristic.
The method comprises the steps of obtaining a corresponding image frame by sampling a frame of the target video, calling a preset image encoder, extracting features of the image frame based on the image encoder to obtain a corresponding feature vector, and taking the feature vector as the image feature. According to the application, the corresponding image frames are obtained by sampling the frames of the target video, and then the image frames are subjected to feature extraction based on the use of the image encoder, so that the image feature extraction processing of the target video can be efficiently and accurately completed, and the accuracy and normalization of the generated image features are improved.
In some alternative implementations, the hierarchical pointer network further includes a linear layer, and step S205 includes the steps of:
and carrying out prediction processing of the starting position on the characteristic representation of each frame based on the linear layer to obtain the probability value of each frame.
In this embodiment, for the feature representation of each frame output by the cross-attention layer, a linear layer is used to map it onto the bi-classification task. And applies an activation function (e.g., sigmoid) to convert the output to a probability value that indicates the probability of whether the frame is the start of a fragment.
And acquiring a preset probability threshold.
In this embodiment, the value of the probability threshold is not specifically limited, and may be set according to the actual service requirement, for example, may be set to 0.5.
And screening designated frames with probability values larger than the probability threshold value from the frames, and taking the designated frames as the fragment starting positions.
In this embodiment, frames with probability values exceeding the probability threshold (e.g., 0.5) are marked as possible segment start positions by traversing the probability values of all frames. Wherein for each frame feature, it is ensured that all segments are predicted by predicting whether it is the starting position of the segment that needs to be retrieved.
And determining a corresponding fragment starting interval based on the fragment starting position.
In this embodiment, all possible segment start intervals are determined according to the obtained start positions. These fragment start intervals may overlap or not be adjacent to each other.
The method comprises the steps of carrying out prediction processing on the characteristic representation of each frame based on the linear layer to obtain probability values of the frames, obtaining a preset probability threshold value, screening designated frames with probability values larger than the probability threshold value from the frames, taking the designated frames as the fragment starting positions, and determining corresponding fragment starting intervals based on the fragment starting positions. According to the application, the characteristic representation of each frame is subjected to the prediction processing of the initial position based on the use of the linear layer in the laminated pointer network to obtain the probability value of each frame, then the appointed frame with the probability value larger than the preset probability threshold value is screened out from each frame, and is used as the segment initial position, and the corresponding segment initial interval is determined based on the segment initial position, so that the characteristic representation of a plurality of frames can be rapidly and accurately finished to perform the prediction processing, the corresponding segment initial position and the corresponding segment initial interval are obtained, and the target segment position corresponding to the search text can be accurately determined according to the obtained segment initial interval.
In some alternative implementations, the hierarchical pointer network further includes a full connection layer, and step S206 includes the steps of:
and extracting first characteristic representations of all target frames contained in the segment start interval.
In this embodiment, since the predicted segment start interval is only preliminary, a more accurate bias is also required for predicting the predicted start position. Wherein for each initially determined segment start interval, a first feature representation of all frames within the segment start interval is extracted.
And performing cross-attention processing on the first characteristic representation and the text embedded representation to obtain a corresponding second characteristic representation.
In this embodiment, the cross-attention process is performed again by embedding these feature representations with the text to further fuse the text and image information and obtain a cross-attention feature representation, that is, the above-described second feature representation. Wherein the cross attention process may be implemented by adding one more encoder layer in the cross attention layer or using another independent cross attention layer.
And carrying out aggregation treatment on the second characteristic representation based on a preset pooling strategy to obtain a corresponding representation vector.
In this embodiment, the second feature representation after cross-attention may be aggregated using a pooling method (e.g., average pooling) to obtain a representation vector of a fixed dimension.
And mapping the representation vector to an offset prediction task based on the full connection layer to obtain a corresponding prediction confidence and a prediction offset value.
In this embodiment, the above-described representation vectors are mapped onto the bias prediction task by using the full connection layer. This typically includes one output for predicting confidence and one output for predicting side-to-side bias. Where the confidence output may be normalized using a sigmoid activation function, and the bias output may be directly output as a value between 0 and 1 (by an appropriate activation function or normalization process).
And taking the prediction confidence and the prediction bias value as a prediction result corresponding to the fragment start interval.
The method comprises the steps of extracting first characteristic representations of all target frames contained in a segment starting interval, carrying out cross attention processing on the first characteristic representations and text embedded representations to obtain corresponding second characteristic representations, carrying out aggregation processing on the second characteristic representations based on a preset pooling strategy to obtain corresponding representation vectors, mapping the representation vectors onto bias prediction tasks based on the full-connection layer to obtain corresponding prediction confidence and prediction bias values, and finally taking the prediction confidence and the prediction bias values as prediction results corresponding to the segment starting interval. According to the method, the first characteristic representation of all target frames contained in the segment starting interval is extracted, the first characteristic representation and the text embedding representation are subjected to cross attention processing to obtain the second characteristic representation, then the second characteristic representation is subjected to aggregation processing based on the use of a pooling strategy to obtain the representation vector, and the representation vector is mapped to the bias prediction task based on the use of the full-connection layer in the stacked pointer network to obtain the corresponding prediction confidence and the prediction bias value, so that the process of performing bias prediction processing on the segment starting interval by using the text embedding representation is rapidly and accurately completed, the accuracy of a generated prediction result is improved, and the position of the target segment corresponding to the search text can be accurately determined according to the obtained prediction result.
In some alternative implementations of the present embodiment, step S207 includes the steps of:
and screening out a first fragment starting interval with the prediction confidence coefficient larger than a preset confidence coefficient threshold value from all the fragment starting intervals.
In this embodiment, the confidence threshold is not specifically limited, and may be set according to actual service requirements. Specifically, the starting interval of each fragment can be screened according to the obtained prediction confidence. The intervals with high confidence (segment start intervals with predicted confidence greater than the confidence threshold) are more likely to be retained as final segment positions, while the intervals with low confidence (segment start intervals with predicted confidence less than the confidence threshold) may be discarded or further processed.
And acquiring a specified prediction bias value corresponding to the first segment start interval.
In this embodiment, the specified prediction bias value refers to a prediction bias value having a matching relationship with the first segment start interval.
And performing corresponding position fine adjustment processing on the first segment starting interval based on the appointed prediction bias value to obtain a second segment starting interval after fine adjustment.
In this embodiment, the starting position and the ending position of the first segment starting section after screening may be fine-tuned by using the obtained specified prediction bias value. The predictive bias value represents an offset of the start position of the first segment start interval relative to the actual start position, so that the adjusted start position can be obtained by adding the predictive bias value to the frame number of the first segment start interval. The end position can be obtained by adding a fixed interval length (or dynamically determined according to other information) to the start position, so as to obtain the fine-tuned second segment start interval.
And converting the second segment starting interval based on a preset target conversion form to obtain a processed third segment starting interval.
In this embodiment, the start position and the end position of the trimmed second segment start interval may be represented in the form of a frame number or a time stamp, which is used as the final segment position, i.e. the processed third segment start interval. These locations may be used in subsequent video processing or analysis tasks.
And generating a target fragment position corresponding to the search text based on the third fragment start interval.
In this embodiment, the obtained third segment start interval may be directly used as the target segment position corresponding to the search text.
The method comprises the steps of screening out a first segment starting interval with prediction confidence coefficient larger than a preset confidence coefficient threshold value from all segment starting intervals, obtaining a specified prediction bias value corresponding to the first segment starting interval, performing corresponding position fine adjustment processing on the first segment starting interval based on the specified prediction bias value to obtain a second segment starting interval after fine adjustment, performing conversion processing on the second segment starting interval based on a preset target conversion form to obtain a processed third segment starting interval, and finally generating a target segment position corresponding to the search text based on the third segment starting interval. According to the method, the first segment starting interval with the prediction confidence coefficient larger than the preset confidence coefficient threshold value is screened out from all the segment starting intervals, the position fine adjustment processing is further carried out on the first segment starting interval based on the appointed prediction bias value corresponding to the first segment starting interval, the second segment starting interval after fine adjustment is obtained, and the second segment starting interval is subsequently converted based on the target conversion form, so that the final target segment position corresponding to the retrieval text can be quickly and accurately generated, the accuracy of the obtained target segment position is guaranteed, and efficient and accurate video content retrieval is achieved.
In some optional implementations of this embodiment, after step S207, the electronic device may further perform the following steps:
and generating a corresponding search result based on the target fragment position.
In this embodiment, the target segment position may be input into a preset search result template to generate a corresponding search result. The search template is a template file which is constructed in advance according to the output requirement of an actual search result.
And acquiring a data pushing mode corresponding to the user.
In this embodiment, the data pushing manner may refer to a pushing manner preferred by a user, and may be mail pushing, interface displaying, short line pushing, and the like.
And pushing the search result to the user based on the data pushing mode.
In this embodiment, the obtained data pushing manner may be used to push the generated search result to the user.
The method comprises the steps of generating a corresponding search result based on the target fragment position, then obtaining a data pushing mode corresponding to the user, and subsequently pushing the search result to the user based on the data pushing mode. According to the method and the device, after the target segment position corresponding to the search text is generated based on the prediction result, the corresponding search result is automatically and intelligently generated based on the target segment position, and then the search result is pushed to the user according to the data pushing mode corresponding to the user, so that the user can timely receive the fed-back search result, the pushing intelligence of the search result is improved, and the use experience of the user is improved.
In some alternative implementations, the obtained user information solicits user consent and meets the specifications of the relevant laws and relevant policies.
In addition, the non-native company software tools or components present in the embodiments of the present application are presented by way of example only and are not representative of actual use.
In addition, the video time searching based on the layered pointer network provided by the application unifies the time searching task and the relation extracting task. Only the input for relation extraction is text and the input for time retrieval is a frame. The positioning problem is decomposed into primary positioning and secondary prediction, so that the learning difficulty is greatly reduced. The searching task is split into two steps, the initial position of the segment is initially predicted in the first step, and then the bias of the position is accurately predicted, so that the task learning is simpler.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
It should be emphasized that, to further ensure the privacy and security of the target segment locations, the target segment locations may also be stored in a blockchain node.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an artificial intelligence-based data retrieval apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 3, the artificial intelligence based data retrieval apparatus 300 according to the present embodiment includes a first acquisition module 301, a preprocessing module 302, an extraction module 303, a second processing module 304, a first prediction module 305, a second prediction module 306, and a first generation module 307. Wherein:
A first obtaining module 301, configured to obtain a search text corresponding to a target video input by a user;
The preprocessing module 302 is configured to perform text preprocessing on the search text to obtain a corresponding text embedded representation;
the extracting module 303 is configured to extract image features of the target video to obtain corresponding image features;
the second processing module 304 is configured to invoke a preset stacked pointer network, and perform cross attention processing on the text embedded representation and the image feature based on a cross attention layer in the stacked pointer network, so as to obtain feature representations of a plurality of frames;
A first prediction module 305, configured to perform prediction processing on the feature representations of the multiple frames to obtain corresponding segment start positions, and determine corresponding segment start intervals based on the segment start positions;
the second prediction module 306 is configured to perform bias prediction processing on the segment start interval based on the text embedding representation, so as to obtain a corresponding prediction result;
a first generating module 307, configured to generate a target segment location corresponding to the search text based on the prediction result.
In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based data retrieval method in the foregoing embodiment one by one, and are not described herein again.
In some alternative implementations of the present embodiment, the preprocessing module 302 includes:
the word segmentation sub-module is used for segmenting the search text based on a preset word segmentation device to obtain a corresponding word segmentation result;
The first calling sub-module is used for calling a preset word embedding model;
The first conversion sub-module is used for carrying out conversion processing on the word segmentation result based on the word embedding model to obtain a corresponding embedded representation;
a first determination submodule for taking the embedded representation as the text embedded representation.
In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based data retrieval method in the foregoing embodiment one by one, and are not described herein again.
In some alternative implementations of the present embodiment, the extracting module 303 includes:
The sampling sub-module is used for sampling the frames of the target video to obtain corresponding image frames;
the second calling sub-module is used for calling a preset image encoder;
The extraction sub-module is used for extracting the characteristics of the image frames based on the image encoder to obtain corresponding characteristic vectors;
and the second determination submodule is used for taking the characteristic vector as the image characteristic.
In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based data retrieval method in the foregoing embodiment one by one, and are not described herein again.
In some alternative implementations of the present embodiment, the hierarchical pointer network further includes a linear layer, and the first prediction module 305 includes:
The prediction sub-module is used for carrying out prediction processing on the initial positions of the characteristic representations of the frames based on the linear layer to obtain probability values of the frames;
The first acquisition submodule is used for acquiring a preset probability threshold value;
the first screening sub-module is used for screening designated frames with probability values larger than the probability threshold value from the frames, and taking the designated frames as the fragment starting positions;
and the third determination submodule is used for determining a corresponding fragment starting interval based on the fragment starting position.
In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based data retrieval method in the foregoing embodiment one by one, and are not described herein again.
In some alternative implementations of the present embodiment, the hierarchical pointer network further includes a fully connected layer, and the second prediction module 306 includes:
the second acquisition sub-module is used for extracting first characteristic representations of all target frames contained in the fragment starting interval;
The processing sub-module is used for carrying out cross attention processing on the first characteristic representation and the text embedded representation to obtain a corresponding second characteristic representation;
The aggregation sub-module is used for conducting aggregation processing on the second characteristic representation based on a preset pooling strategy to obtain a corresponding representation vector;
the mapping sub-module is used for mapping the representation vector to an offset prediction task based on the full connection layer to obtain a corresponding prediction confidence and a prediction offset value;
And a fourth determining sub-module, configured to use the prediction confidence and the prediction bias value as a prediction result corresponding to the segment start interval.
In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based data retrieval method in the foregoing embodiment one by one, and are not described herein again.
In some alternative implementations of the present embodiment, the first generating module 307 includes:
the second screening submodule is used for screening out a first fragment starting interval with the prediction confidence coefficient larger than a preset confidence coefficient threshold value from all the fragment starting intervals;
a third obtaining sub-module, configured to obtain a specified prediction bias value corresponding to the first segment start interval;
the fine adjustment sub-module is used for carrying out corresponding position fine adjustment processing on the first segment starting interval based on the appointed prediction bias value to obtain a second segment starting interval after fine adjustment;
The second conversion sub-module is used for carrying out conversion processing on the second segment starting interval based on a preset target conversion form to obtain a processed third segment starting interval;
And the generation sub-module is used for generating a target fragment position corresponding to the search text based on the third fragment starting interval.
In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based data retrieval method in the foregoing embodiment one by one, and are not described herein again.
In some optional implementations of the present embodiments, the artificial intelligence based data retrieval device further includes:
The second generation module is used for generating a corresponding search result based on the target fragment position;
the second acquisition module is used for acquiring a data pushing mode corresponding to the user;
and the pushing module is used for pushing the search result to the user based on the data pushing mode.
In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based data retrieval method in the foregoing embodiment one by one, and are not described herein again.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of an artificial intelligence-based data retrieval method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as executing computer readable instructions of the artificial intelligence based data retrieval method.
The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
In the embodiment of the application, after the search text corresponding to the target video input by the user is acquired, text pretreatment is carried out on the search text to obtain text embedded representation, image feature extraction is carried out on the target video to obtain image features, then cross attention processing is carried out on the text embedded representation and the image features based on the use of a stacked pointer network to obtain the feature representations of a plurality of frames, and further, the feature representations of the frames are processed, the initial position of a segment is initially predicted, then bias prediction is accurately carried out, and further, the target segment position corresponding to the input search text can be rapidly and accurately generated based on the obtained prediction result, the processing efficiency of video data retrieval is improved, and the data accuracy of the target segment position obtained by retrieval is ensured.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the artificial intelligence-based data retrieval method as described above.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
In the embodiment of the application, after the search text corresponding to the target video input by the user is acquired, text pretreatment is carried out on the search text to obtain text embedded representation, image feature extraction is carried out on the target video to obtain image features, then cross attention processing is carried out on the text embedded representation and the image features based on the use of a stacked pointer network to obtain the feature representations of a plurality of frames, and further, the feature representations of the frames are processed, the initial position of a segment is initially predicted, then bias prediction is accurately carried out, and further, the target segment position corresponding to the input search text can be rapidly and accurately generated based on the obtained prediction result, the processing efficiency of video data retrieval is improved, and the data accuracy of the target segment position obtained by retrieval is ensured.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims (10)

1. The data retrieval method based on artificial intelligence is characterized by comprising the following steps:
acquiring a search text corresponding to a target video, which is input by a user;
performing text preprocessing on the search text to obtain a corresponding text embedded representation;
extracting image features of the target video to obtain corresponding image features;
calling a preset laminated pointer network, and carrying out cross attention processing on the text embedded representation and the image characteristic based on a cross attention layer in the laminated pointer network to obtain a characteristic representation of a plurality of frames;
Predicting the characteristic representations of the frames to obtain corresponding fragment starting positions, and determining corresponding fragment starting intervals based on the fragment starting positions;
performing bias prediction processing on the segment starting interval based on the text embedding representation to obtain a corresponding prediction result;
and generating a target fragment position corresponding to the search text based on the prediction result.
2. The method for retrieving data based on artificial intelligence according to claim 1, wherein the step of performing text preprocessing on the retrieved text to obtain a corresponding text embedded representation specifically comprises:
word segmentation is carried out on the search text based on a preset word segmentation device, and a corresponding word segmentation result is obtained;
Calling a preset word embedding model;
converting the word segmentation result based on the word embedding model to obtain a corresponding embedded representation;
the embedded representation is taken as the text embedded representation.
3. The method for retrieving data based on artificial intelligence according to claim 1, wherein the step of extracting image features from the target video to obtain corresponding image features specifically comprises:
performing frame sampling on the target video to obtain a corresponding image frame;
calling a preset image encoder;
extracting features of the image frames based on the image encoder to obtain corresponding feature vectors;
and taking the characteristic vector as the image characteristic.
4. The method for retrieving data based on artificial intelligence according to claim 1, wherein the hierarchical pointer network further comprises a linear layer, wherein the step of predicting the feature representations of the plurality of frames to obtain corresponding segment start positions and determining corresponding segment start intervals based on the segment start positions specifically comprises:
performing prediction processing of starting positions on the characteristic representations of the frames based on the linear layer to obtain probability values of the frames;
acquiring a preset probability threshold;
selecting a designated frame with a probability value larger than the probability threshold value from the frames, and taking the designated frame as the fragment starting position;
And determining a corresponding fragment starting interval based on the fragment starting position.
5. The method for retrieving data based on artificial intelligence according to claim 1, wherein the hierarchical pointer network further comprises a full connection layer, and the step of performing bias prediction processing on the segment start interval based on the text embedding representation to obtain a corresponding prediction result specifically comprises:
Extracting first characteristic representations of all target frames contained in the segment start interval;
performing cross attention processing on the first characteristic representation and the text embedded representation to obtain a corresponding second characteristic representation;
Performing aggregation treatment on the second characteristic representation based on a preset pooling strategy to obtain a corresponding representation vector;
Mapping the representation vector to an offset prediction task based on the full connection layer to obtain a corresponding prediction confidence and a prediction offset value;
And taking the prediction confidence and the prediction bias value as a prediction result corresponding to the fragment start interval.
6. The method for retrieving data based on artificial intelligence according to claim 5, wherein the step of generating the target segment position corresponding to the retrieved text based on the prediction result specifically comprises:
screening out a first fragment starting interval with the prediction confidence coefficient larger than a preset confidence coefficient threshold value from all the fragment starting intervals;
acquiring a specified prediction bias value corresponding to the first segment start interval;
Performing corresponding position fine adjustment processing on the first segment starting interval based on the appointed prediction bias value to obtain a second segment starting interval after fine adjustment;
converting the second segment starting interval based on a preset target conversion form to obtain a processed third segment starting interval;
And generating a target fragment position corresponding to the search text based on the third fragment start interval.
7. The artificial intelligence based data retrieval method according to claim 1, further comprising, after the step of generating a target segment position corresponding to the retrieval text based on the prediction result:
generating a corresponding search result based on the target segment position;
acquiring a data pushing mode corresponding to the user;
and pushing the search result to the user based on the data pushing mode.
8. An artificial intelligence based data retrieval apparatus comprising:
The first acquisition module is used for acquiring search texts corresponding to the target videos and input by a user;
The preprocessing module is used for preprocessing the text of the search text to obtain a corresponding text embedded representation;
the extraction module is used for extracting image features of the target video to obtain corresponding image features;
The second processing module is used for calling a preset laminated pointer network, and carrying out cross attention processing on the text embedded representation and the image feature based on a cross attention layer in the laminated pointer network to obtain feature representations of a plurality of frames;
The first prediction module is used for performing prediction processing on the characteristic representations of the frames to obtain corresponding fragment starting positions, and determining corresponding fragment starting intervals based on the fragment starting positions;
The second prediction module is used for carrying out bias prediction processing on the segment start interval based on the text embedding representation to obtain a corresponding prediction result;
And the first generation module is used for generating a target fragment position corresponding to the search text based on the prediction result.
9. A computer device comprising a memory having stored therein computer readable instructions which when executed implement the steps of the artificial intelligence based data retrieval method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the artificial intelligence based data retrieval method according to any of claims 1 to 7.
CN202411630356.3A 2024-11-14 2024-11-14 Data retrieval method, device, computer equipment and medium based on artificial intelligence Pending CN119577186A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411630356.3A CN119577186A (en) 2024-11-14 2024-11-14 Data retrieval method, device, computer equipment and medium based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411630356.3A CN119577186A (en) 2024-11-14 2024-11-14 Data retrieval method, device, computer equipment and medium based on artificial intelligence

Publications (1)

Publication Number Publication Date
CN119577186A true CN119577186A (en) 2025-03-07

Family

ID=94800858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411630356.3A Pending CN119577186A (en) 2024-11-14 2024-11-14 Data retrieval method, device, computer equipment and medium based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN119577186A (en)

Similar Documents

Publication Publication Date Title
CN117312535B (en) Method, device, equipment and medium for processing problem data based on artificial intelligence
CN112395390B (en) Training corpus generation method of intention recognition model and related equipment thereof
CN112199954B (en) Disease entity matching method and device based on voice semantics and computer equipment
CN114359582B (en) Small sample feature extraction method based on neural network and related equipment
CN117874234A (en) Text classification method and device based on semantics, computer equipment and storage medium
CN118070072A (en) Problem processing method, device, equipment and storage medium based on artificial intelligence
CN116821298A (en) Keyword automatic identification method applied to application information and related equipment
CN116450943A (en) Artificial intelligence-based speaking recommendation method, device, equipment and storage medium
CN119577148A (en) A text classification method, device, computer equipment and storage medium
CN118964649A (en) Artificial intelligence-based manuscript generation method, device, computer equipment and medium
CN119166763A (en) Query processing method, device, computer equipment and medium based on artificial intelligence
CN113609833B (en) Dynamic file generation method and device, computer equipment and storage medium
CN116166858A (en) Information recommendation method, device, equipment and storage medium based on artificial intelligence
CN116502624A (en) Corpus expansion method and device, computer equipment and storage medium
CN116738948A (en) Data processing method, device, computer equipment and storage medium
CN116702928A (en) Model performance improving method and device, computer equipment and storage medium
CN119577186A (en) Data retrieval method, device, computer equipment and medium based on artificial intelligence
CN115544282A (en) Data processing method, device and equipment based on graph database and storage medium
CN113111181A (en) Text data processing method and device, electronic equipment and storage medium
CN112069807A (en) Text data theme extraction method and device, computer equipment and storage medium
CN119864015B (en) Audio generation method, device, equipment and storage medium thereof
CN115952295B (en) Image-based entity relationship annotation model processing method and related equipment
CN117076775A (en) Information data processing method, information data processing device, computer equipment and storage medium
CN117271790A (en) Method and device for expanding annotation data, computer equipment and storage medium
CN119398014A (en) A dynamic form generation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination