CN115203445A - Multimedia resource searching method, device, equipment and medium - Google Patents

Multimedia resource searching method, device, equipment and medium Download PDF

Info

Publication number
CN115203445A
CN115203445A CN202210855628.4A CN202210855628A CN115203445A CN 115203445 A CN115203445 A CN 115203445A CN 202210855628 A CN202210855628 A CN 202210855628A CN 115203445 A CN115203445 A CN 115203445A
Authority
CN
China
Prior art keywords
text
keyword
word
preset
index table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210855628.4A
Other languages
Chinese (zh)
Inventor
朱运
乔建秀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210855628.4A priority Critical patent/CN115203445A/en
Publication of CN115203445A publication Critical patent/CN115203445A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention relates to the technical field of artificial intelligence, and provides a multimedia resource searching method, device, equipment and medium. Extracting character contents from a text to obtain a text segment, storing the text segment to a preset database, and segmenting words of the text segment to obtain first key words; constructing an inverted index table according to the first key words, and storing the classification labels of the text segments into the inverted index table to construct a multimedia library; extracting a second keyword from the query request, searching a classification label of the first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading a text segment from a preset database according to the classification label; and scoring and sequencing the similarity among the text segments, selecting the text segments according to the sequencing sequence, rendering the text segments into corresponding texts, and outputting the texts to a user side. The invention also relates to the technical field of block chains, and the first keyword and the second keyword can also be stored in a node of a block chain.

Description

Multimedia resource searching method, device, equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a multimedia resource searching method, device, equipment and medium.
Background
With the rapid development of the internet, multimedia resource search is currently an important topic. In general, a multimedia resource search respectively constructs a plurality of content libraries for the contents of different types of text (web page text, PDF text, picture text, video text). Then, when a user inputs keyword search content, the background searches the plurality of content libraries respectively, returns all different types of texts associated with the keywords to the user, and needs to switch the types of the texts to and fro in a display interface for watching, and the user needs to spend time for screening the text segments wanted by the user in each text, so that the user time is consumed, and the accuracy of the searched text segments is low due to manual operation of the user.
Disclosure of Invention
In view of the above, the present invention provides a method, an apparatus, a device and a medium for searching multimedia resources, and aims to solve the technical problems of low efficiency and low accuracy in searching various different types of text segments in the prior art.
In order to achieve the above object, the present invention provides a multimedia resource searching method, which comprises:
respectively extracting character contents from various different types of texts to obtain one or more text segments, storing the text segments in a preset database, and segmenting each text segment to obtain a first keyword of each text segment;
constructing an inverted index table of word search according to the first keyword, and storing the classification label of each text segment to the inverted index table to construct a multimedia library;
receiving a query request sent by a user side, extracting a second keyword from the query request, searching a classification label of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading a corresponding text segment from the preset database according to the classification label obtained by retrieval;
and scoring the similarity among the text fragments, sorting the obtained scoring values according to a preset sorting sequence, selecting a preset number of text fragments according to the sorting sequence, rendering the text fragments into corresponding texts, and outputting the texts to the user side.
Preferably, the extracting text content from the texts of different types to obtain one or more text segments and storing the one or more text segments in a preset database includes:
dividing each type of text into a format part and a text content part, and performing segment division on the text content part to obtain one or more text segments and storing the text segments in a preset database.
Preferably, the segmenting each text segment to obtain the first keyword of each text segment includes:
dividing the long text sentence of each text segment according to a preset word segmentation algorithm to obtain a plurality of word groups;
and calculating the similarity value between adjacent phrases, and taking the phrase with the similarity value smaller than a preset threshold value as a first keyword.
Preferably, after the constructing the inverted index table of the word search according to the first keyword, the method further includes:
counting word frequency values of the first keywords appearing in the corresponding text segments;
comparing the word frequency value with a preset word frequency value, and if the word frequency value is greater than or equal to the preset word frequency value, filling the first keyword into a high-frequency word queue in the inverted index table;
and if the word frequency value is smaller than a preset word frequency value, filling the first keyword into a low-frequency word queue in the inverted index table.
Preferably, before the storing the classification label of each text segment to the inverted index table to construct a multimedia library, the method further includes:
reading a text sequence of a first keyword of each text segment, inputting the text sequence into a preset classification model for marking and embedding to obtain word vector characteristics;
matching the classification labels of the text segments from the label modules of the preset classification models according to the word vector characteristics, and establishing a mapping relation between the classification labels and the first keywords of the text segments.
Preferably, the extracting the second keyword from the query request includes:
performing word segmentation on the information of the query request to obtain a plurality of participles;
and generating a dictionary tree according to a pre-constructed dictionary word list, and inputting the plurality of participles into the dictionary tree for traversal to obtain the second keyword.
Preferably, the searching, according to the inverted index table and the second keyword, a classification tag of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library, and reading a corresponding text segment from the preset database according to the retrieved classification tag includes:
inputting the second keyword into a search engine of the inverted index table;
traversing the first keywords in the inverted index table according to the search engine to obtain first keywords related to the second keywords;
and reading the classification label of the associated first keyword according to the mapping relation, and reading the corresponding text segment from the preset database according to the retrieved classification label.
In order to achieve the above object, the present invention further provides a multimedia resource search apparatus, comprising:
an extraction module: the system comprises a database, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for extracting word contents from various texts of different types respectively to obtain one or more text segments, storing the text segments into a preset database, and segmenting each text segment to obtain a first keyword of each text segment;
a storage module: the reverse index table is used for constructing word search according to the first key words, and the classification labels of the text segments are stored in the reverse index table to construct a multimedia library;
the query module: the system comprises a multimedia library, a query request, a search module and a database, wherein the multimedia library is used for receiving the query request sent by a user side, extracting a second keyword from the query request, searching a classification label of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading a corresponding text segment from the preset database according to the classification label obtained by retrieval;
an output module: and the system is used for grading the similarity among the text fragments, sequencing the obtained grading values according to a preset sequencing sequence, selecting a preset number of text fragments according to the sequencing sequence, rendering the text fragments into corresponding texts and outputting the texts to the user side.
To achieve the above object, the present invention also provides an electronic device, including:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores a program executable by the at least one processor, the program being executed by the at least one processor to enable the at least one processor to perform the multimedia asset searching method according to any one of claims 1 to 7.
To achieve the above object, the present invention further provides a computer readable medium storing a multimedia resource, which when executed by a processor, implements the steps of the multimedia resource searching method according to any one of claims 1 to 7.
The method extracts the first keywords and the text segments of the texts of different types, constructs the inverted index table of word search according to all the first keywords, and stores the classification labels of all the text segments into the inverted index table to construct the multimedia library, so that the content of the texts of different types is searched under a unified index architecture, and the cost and the search time for constructing a plurality of content libraries are reduced.
According to the inverted index table and the second keywords inquired by the user, the multimedia library is searched to obtain a plurality of text segments of the first keywords related to the second keywords, the similarity of the text segments is scored and sequenced, the text segments which are sequenced before are selected and rendered into corresponding texts, and the corresponding texts are output to the user side, so that the text segments are used as search results, and the texts of various different types are displayed in a display interface in a mixed manner, manual operation of the user is reduced, and the searching accuracy and efficiency are improved.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating a preferred embodiment of a multimedia resource searching method according to the present invention;
FIG. 2 is a block diagram of a multimedia resource searching apparatus according to a preferred embodiment of the present invention;
FIG. 3 is a diagram of an electronic device according to a preferred embodiment of the present invention;
the objects, features and advantages of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The invention provides a multimedia resource searching method. Fig. 1 is a schematic method flow diagram of an embodiment of the multimedia resource searching method of the present invention. The method may be performed by an electronic device, which may be implemented by software and/or hardware. The multimedia resource searching method comprises the following steps S10-S40:
step S10: the method comprises the steps of extracting character contents from various texts of different types respectively to obtain one or more text segments, storing the text segments in a preset database, and segmenting words of each text segment to obtain first keywords of each text segment.
In this embodiment, the different types of text include, but are not limited to, web page text, PDF text, picture text, and video text. The methods for extracting text contents from different types of texts are different, the extracted text contents have more space, the text contents are divided into at least one text segment according to punctuation marks (periods, exclamation marks, semicolons) or paragraphs in the text contents, the text segment refers to a text pause caused by turning, emphasizing, intermittence and the like during expression of thought contents of the text, and people generally refer to a 'natural segment'. Dividing words of each text segment, and taking words representing important word senses and semantics of the text segments as first keywords of the text segments, wherein the first keywords are one of main methods for searching and indexing a multimedia library and are also specific name terms of products, services and the like of an enterprise which a user wants to know.
In one embodiment, the extracting text content from the different types of texts to obtain one or more text segments and storing the one or more text segments in a preset database includes:
dividing each type of text into a format part and a text content part, and performing segment division on the text content part to obtain one or more text segments and storing the text segments in a preset database.
In one embodiment, the plurality of different types of text formats includes: and taking HTML codes of the webpage text, coordinate information of the character content of the PDF text, coordinate information of the character content of the picture text and the initial time period of the video text in the playing time axis as the format of the text.
The method divides format parts and character content parts of various different types of texts, is a basic condition for searching text segments in the texts, and is a precondition for simultaneously realizing mixed display of the various different types of texts on a user interface and reducing the time spent by a user for screening the text segments required by the user in each text.
Dividing each type of text into a format part and a text content part, and specifically comprising the following steps:
webpage text: the HTML code portion and the text content portion of the web page text are separated. For example, when a web page text (e.g., web address: https:// www.163. Com/dy/arrow \8230;) is opened, clicking the "show web page source code" button of the right mouse button, the current web page text will show the HTML code and text content mixed together, e.g., "< title > quick to see: final complete form of the Chinese space station! The manned spacecraft rocket net easy subscription 8230, the HTML 8230, reading HTML code (format) < title > </title > "and text" quickly see: final complete form of the Chinese space station! The method comprises the steps of separating a manned spacecraft, a spaceman, a Shenzhou rocket and a network easy subscription, dividing the text content into a set containing at least one text segment according to the title and the paragraph of the text content, and respectively storing an HTML code (format) and the set of the text segment into a preset database.
PDF text: extracting a text content part of a PDF text and a coordinate information part of the text content through an OCR (character recognition) algorithm, dividing the text content into a set containing at least one text segment according to a title and a paragraph of the text content, and respectively storing the coordinate information of the text content and the set of the text segment into a preset database. The coordinate information of the text content is coordinate information of a line of text, and the coordinate information includes coordinate information of the x-axis and the y-axis of the vertex of the rectangular frame of the line of text, and four element information such as the length and the width of the rectangular frame. The OCR trains and judges which region in the PDF text may contain the text through a preset character recognition model, and then performs character recognition on the region. For example, in the case of a PDF text, the text recognition model first generates candidate rectangular boxes, determines the likelihood that the boxes contain text, and then identifies the text within the boxes.
Picture text: extracting a text content part of a picture text and a coordinate information part of the text content through an OCR (character recognition) algorithm, dividing the text content into a set containing at least one text segment according to a title and a paragraph of the text content, and respectively storing the coordinate information of the text content and the set of the text segment into a preset database.
Video text: the method comprises the steps of identifying subtitles and voices in a video text through an ASR (automatic speech recognition) algorithm to extract to obtain a text content part, dividing the text content into a set containing at least one text segment according to the similarity of keywords of the subtitles and/or the pause of the voices, reading the starting time period of each text segment in a playing time axis, and storing the starting time period and the set of the text segments into a preset database.
In an embodiment, the segmenting each text segment to obtain the first keyword of each text segment includes:
dividing the long text sentence of each text fragment according to a preset word segmentation algorithm to obtain a plurality of word groups;
and calculating the similarity value between adjacent phrases, and taking the phrase with the similarity value smaller than a preset threshold value as a first keyword.
The predetermined word segmentation algorithm includes, but is not limited to, a greedy algorithm and a blocking algorithm. Dividing the long text sentence of each text segment to obtain a word sequence vector, wherein the word sequence vector comprises a plurality of word groups obtained by segmenting the text segments, calculating similarity values between every two adjacent word groups, reading and judging whether the similarity values are smaller than a preset threshold (for example, the preset threshold is 1), and taking the word groups smaller than the preset threshold as first keywords.
According to the method and the device, the first keywords of each text segment are extracted, the first keywords represent the central theme and the core thought of each text segment, the corresponding text segments can be screened out through the first keywords, and if the relevance of the extracted first keywords is larger, the searching efficiency and accuracy are improved.
Step S20: and constructing an inverted index table for word search according to the first keyword, and storing the classification label of each text segment to the inverted index table to construct a multimedia library.
In this embodiment, the inverted index table is used to record a list of which first keywords are included in the text segment. And storing the classification labels of all the text segments into a queue of the first key words corresponding to the inverted index table to construct a multimedia library. The multimedia library realizes the search of the contents of various texts with different types under a unified index architecture by using the inverted index table, and reduces the cost and the search time for constructing a plurality of content libraries.
In the set of text segments, there are many text segments containing the same first keyword, each text segment records information of each first keyword (for example, an arrangement sequence number and a sharing frequency of the first keyword in the inverted index table) in a document number (DocID) in the inverted index table, and also records information such as the frequency of occurrence of the first keyword in the text segment (word frequency IDF) and positions of the first keyword in the text segment, and the information related to one text segment is used as an inverted index entry (nesting), and a series of inverted index entries containing all the first keywords form a structure of the inverted index table.
The core of the inverted index table contains the contents of two parts (word dictionary and inverted list):
1. dictionary word list: all the first keywords are recorded to form a list, and the splitting granularity of the first keyword can be realized according to specific requirements. Dictionary vocabularies are generally large and can be implemented through a B + tree or a hash chain table to satisfy high-performance insertion in query and custom editing (e.g., deletion, addition, and modification of a first keyword).
2. Inverted arrangement table: the relation between the first keyword and the corresponding text segment is mainly recorded, and the attribute in the relation between the first keyword and the corresponding text segment is called an inverted index item, wherein the inverted index item comprises the DocID, the word frequency (the word frequency refers to the number of times the first keyword appears in the text segment and can be used for calculating the relevancy) and the position (the position refers to the starting position and the ending position) of the first keyword in the text segment.
In one embodiment, after the constructing the inverted index table of the word search according to the first keyword, the method further comprises:
counting word frequency values of the first keywords appearing in the corresponding text segments;
comparing the word frequency value with a preset word frequency value, and if the word frequency value is greater than or equal to the preset word frequency value, filling the first keyword into a high-frequency word queue in the inverted index table;
and if the word frequency value is smaller than a preset word frequency value, filling the first keyword into a low-frequency word queue in the inverted index table.
The word frequency statistics of each obtained first keyword can be performed through a programming model such as MapReduce, and according to a preset word frequency value (for example, the preset word frequency value is 3), the first keywords larger than or equal to the preset word frequency value are used as high-frequency words, and the first keywords smaller than the preset word frequency value are used as low-frequency words, so as to be filled into a queue of the high-frequency words or the low-frequency words in the inverted index table. The high frequency word queue and the low frequency word queue are respectively generated into respective reverse indexes, so that the precision and the speed of searching the first keyword can be improved and the resources of a search engine can be reduced by generating the respective reverse indexes. For example, the mode of generating the high-frequency word queue as a reverse index and the mode of generating the low-frequency word queue as a forward index, or the mode of generating the high-frequency word queue as a forward index and the mode of generating the low-frequency word queue as a reverse index, or the mode of generating the high-frequency word queue and the low-frequency word queue as a reverse index or a forward index at the same time may be generated, and is set according to an actual service scenario, and is not limited herein.
In one embodiment, before storing the classification label of each text segment in the inverted index table to construct the multimedia library, the method further comprises:
reading a text sequence of a first keyword of each text segment, inputting the text sequence into a preset classification model for marking and embedding to obtain word vector characteristics;
matching the classification labels of the text segments from the label modules of the preset classification models according to the word vector characteristics, and establishing a mapping relation between the classification labels and the first keywords of the text segments.
The preset classification model is a classification model obtained by collecting and manually labeling a sample set of text segments containing different keywords and training the sample set through a preset model (bert modeling).
For example, a text sequence of each first keyword of a read text segment a is input into a preset classification model for label embedding, the text sequence is subjected to matrix representation through an encoder, word vector characteristics of each first keyword are output, similarity matching is performed on the word vector characteristics through a characteristic representation fusion layer and a full connection layer, a label module of the classification model outputs a label with the maximum similarity between the label and the word vector characteristics as a classification label of the text segment a, and a mapping relation is established between the classification label and each first keyword of the text segment a. By establishing a mapping relation between the classification label and the first keyword of the text segment, when searching, the corresponding text segment can be found through the classification label only by determining the first keyword, the text segment is not required to be searched for any keyword, the text segment is only required to be stored in a preset database, and the operation speed of the inverted index table is improved.
Step S30: receiving a query request sent by a user side, extracting a second keyword from the query request, searching a classification label of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading the corresponding text segment from the preset database according to the classification label obtained by searching.
In this embodiment, after a user inputs the content of a query request on an interface of a search engine (a search engine of an inverted index table) of a multimedia library at a user end and clicks a "search" button, a search engine program processes the content, such as performing word segmentation processing specific to chinese, removing a stop word, determining whether to start integrated search, and determining whether a spelling error or a wrongly written word exists. The query request can be analyzed and the second keywords can be extracted, after the second keywords are obtained, the search engine program starts matching work, all first keywords with the same or similar semantics as the second keywords are found out from the inverted index table, then the classification labels associated with the first keywords are obtained through searching according to the mapping relation, and a plurality of text segments of the first keywords are obtained through reading from a preset database according to the classification labels.
In one embodiment, the extracting the second keyword from the query request includes:
performing word segmentation on the information of the query request to obtain a plurality of participles;
and generating a dictionary tree according to a pre-constructed dictionary word list, and inputting the plurality of participles into the dictionary tree for traversal to obtain the second keyword.
The method comprises the steps of cutting words of contents of a query request based on a preset word segmentation algorithm (for example, a textrank word segmentation algorithm), extracting related words in the query request and removing stop words, constructing a correlation matrix of the words according to the related words, correcting conditions that spelling errors or wrongly written characters exist in the contents of the query request, obtaining an important hierarchy value of each word through a word segmentation algorithm formula, and selecting a preset number of words ranked in the front as the words according to a sequence of the important hierarchy values from large to small.
The method comprises the steps of matching a word list of first keywords with maximum word prefixes to obtain a dictionary word list based on all the pre-recorded first keywords, generating a tree-structured dictionary tree by taking key values and character strings of each first keyword of the dictionary word list as nodes, counting word frequencies of participles in historical query requests of all users in advance, reading character prefix characteristics of a plurality of participles, starting traversal matching along root nodes of the dictionary tree, and taking words with the same character prefix characteristics of the nodes of the dictionary tree and the same character prefix characteristics of the participles as second keywords. And matching the content of the query request with the combined user historical search behavior according to the dictionary vocabulary to obtain a second keyword, so that the technical problems of wrongly written characters, wrongly grammated syntax and unclear expression of the content input by the user are solved.
In one embodiment, the searching, according to the inverted index table and the second keyword, for a classification tag of a text segment corresponding to the first keyword associated with the second keyword in the multimedia library, and reading the corresponding text segment from the preset database according to the retrieved classification tag includes:
inputting the second keyword into a search engine of the inverted index table;
traversing the first keywords in the inverted index table according to the search engine to obtain first keywords related to the second keywords;
and reading the classification label of the associated first keyword according to the mapping relation, and reading the corresponding text segment from the preset database according to the retrieved classification label.
The search engine that inputs the second keyword into the inverted index table obtains a first keyword associated with the second keyword according to different characteristics of the high-frequency word queue and the low-frequency word queue of the inverted index table in data reading, for example, the search engine traverses the high-frequency word queue of the inverted index table in a reverse indexing manner and traverses the low-frequency word queue of the inverted index table in a forward indexing manner, where the associated first keyword refers to a first keyword that has the same or similar semantic meaning as the second keyword, and which indexing manner is set according to an actual service scenario is not limited herein. And reading the classification labels of the associated first keywords according to the mapping relationship established in the step S20 to obtain a plurality of text segments of the first keywords. By adopting different indexing modes, the technical problems that in the prior art, only a single indexing mode occupies more physical space of a search engine, and indexes need to be dynamically maintained when data in an inverted index table is added, deleted and modified, so that the data maintenance speed is reduced are solved, the physical space is effectively saved, and the convenience of data maintenance is improved.
Step S40: and scoring the similarity among the text fragments, sorting the obtained scoring values according to a preset sorting sequence, selecting a preset number of text fragments according to the sorting sequence, rendering the text fragments into corresponding texts, and outputting the texts to the user side.
In this embodiment, after a plurality of text segments of the first keyword are obtained, the text segments may include text segments of different types such as a web page text, a PDF text, an image text, a video text, and the like, similarity calculation is performed on the text segments, the similarity of the text segments is scored according to a preset scoring algorithm, the obtained score values are sorted according to a preset sorting order (for example, the score values are sorted from high to low), a preset number (for example, 10) of text segments with top ranking are selected according to the sorting order, and formats of the 10 text segments read from a preset database are rendered into corresponding texts and output to a user side.
For example, the first keywords are acquired as 'shenzhou' and 'rocket', the selected text segments are web page texts, all the text segments related to the two keywords are returned according to the 'shenzhou' and 'rocket' first keywords, 10 text segments with the top rank are selected after calculation, corresponding HTML codes are acquired, and the original web page texts are rendered at the user side and displayed to the user. And if the selected text segments are PDF texts and picture texts, reading the corresponding text segments and the coordinate information for rendering. And if the selected text clip is a video text, reading the corresponding text clip and playing the initial time period of the time shaft for rendering.
The preset scoring algorithm comprises the following steps:
Figure BDA0003754423130000111
wherein T is a score value of any text segment, n is the number of first keywords of the text segment, i is the ith first keyword of the text segment, wi is the IDF value of the ith first keyword of the text segment, q is the second keyword queried by the user, d is the d-th public word of the text segment in a dictionary word list, and R (q, d) is the similarity between the text segment and the second keyword queried by the user.
By scoring the similarity among the text segments and selecting the text segments which are in front of the score value in the user query request and the format of the text segments for rendering, the text segments which the user wants to view can be automatically and quickly obtained, the user does not need to spend time to switch the type to view back and forth in the display interface, the user does not need to spend time to discriminate the text segments which the user wants in each text, the time for obtaining the text and the time for rendering are reduced, and the effects of taking the text segments as search results and displaying the texts of different types in a mixed manner in the display interface are achieved.
Referring to fig. 2, a functional block diagram of the multimedia resource searching apparatus 100 according to the present invention is shown.
The multimedia resource searching apparatus 100 of the present invention may be installed in an electronic device. According to the realized functions, the multimedia resource searching apparatus 100 may include an extraction module 110, an extraction module 20, a query module 130, and an output module 140. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In this embodiment, the functions of the modules/units are as follows:
the extraction module 110: the system comprises a database, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for extracting word contents from various different types of texts respectively to obtain one or more text segments, storing the one or more text segments to a preset database, and segmenting each text segment to obtain a first keyword of each text segment;
the storage module 120: the reverse index table is used for building word search according to the first key words, and the classification labels of the text segments are stored in the reverse index table to build a multimedia library;
the query module 130: the system comprises a multimedia library, a preset database and a query request sending module, wherein the multimedia library is used for receiving the query request sent by a user end, extracting a second keyword from the query request, searching a classification label of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading the corresponding text segment from the preset database according to the classification label obtained by searching;
the output module 140: and the system is used for grading the similarity among the text fragments, sequencing the obtained grading values according to a preset sequencing sequence, selecting a preset number of text fragments according to the sequencing sequence, rendering the text fragments into corresponding texts and outputting the texts to the user side.
In one embodiment, the extracting text content from the texts of different types to obtain one or more text segments and storing the text segments in a preset database includes:
dividing each type of text into a format part and a text content part, and performing segment division on the text content part to obtain one or more text segments and storing the text segments in a preset database.
In an embodiment, the segmenting each text segment to obtain the first keyword of each text segment includes:
dividing the long text sentence of each text segment according to a preset word segmentation algorithm to obtain a plurality of word groups;
and calculating the similarity value between adjacent phrases, and taking the phrase with the similarity value smaller than a preset threshold value as a first keyword.
In one embodiment, after the constructing the inverted index table of the word search according to the first keyword, the method further comprises:
counting word frequency values of the first keywords appearing in the corresponding text segments;
comparing the word frequency value with a preset word frequency value, and if the word frequency value is greater than or equal to the preset word frequency value, filling the first keyword into a high-frequency word queue in the inverted index table;
and if the word frequency value is smaller than a preset word frequency value, filling the first keyword into a low-frequency word queue in the inverted index table.
In one embodiment, before storing the classification tags of the text segments in the inverted index table to construct the multimedia library, the method further comprises:
reading a text sequence of a first keyword of each text segment, inputting the text sequence into a preset classification model for marking and embedding to obtain word vector characteristics;
matching the classification labels of the text segments from the label modules of the preset classification models according to the word vector characteristics, and establishing a mapping relation between the classification labels and the first keywords of the text segments.
In one embodiment, the extracting the second keyword from the query request includes:
performing word segmentation on the information of the query request to obtain a plurality of participles;
and generating a dictionary tree according to a pre-constructed dictionary word list, and inputting the plurality of participles into the dictionary tree for traversal to obtain the second keyword.
In one embodiment, the searching, according to the inverted index table and the second keyword, for a classification tag of a text segment corresponding to the first keyword associated with the second keyword in the multimedia library, and reading the corresponding text segment from the preset database according to the retrieved classification tag includes:
inputting the second keyword into a search engine of the inverted index table;
traversing the first keywords in the inverted index table according to the search engine to obtain first keywords related to the second keywords;
and reading the classification label of the associated first keyword according to the mapping relation, and reading the corresponding text segment from the preset database according to the retrieved classification label.
Fig. 3 is a schematic diagram of an electronic device 1 according to a preferred embodiment of the invention.
The electronic device 1 includes but is not limited to: memory 11, processor 12, display 13, and network interface 14. The electronic device 1 is connected to a network through a network interface 14 to obtain raw data. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (GSM), wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a call network.
The memory 11 includes at least one type of readable medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (FlashCard), or the like, which is equipped with the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit and an external memory device of the electronic device 1. In this embodiment, the memory 11 is generally used for storing an operating system installed in the electronic device 1 and various application software, such as program codes of the multimedia resource search 10. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run a program code stored in the memory 11 or process data, for example, a program code of the multimedia resource search 10.
The display 13 may be referred to as a display screen or display unit. In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch panel, or the like. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface, e.g. displaying the results of data statistics.
The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), the network interface 14 typically being used for establishing a communication connection between the electronic device 1 and other electronic devices.
Fig. 3 only shows the electronic device 1 with components 11-14 and the multimedia asset search 10, but it is understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead.
Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
The electronic device 1 may further comprise Radio Frequency (RF) circuitry, sensors, audio circuitry, etc., which will not be described in detail herein.
In the above embodiment, the processor 12 may implement the following steps when executing the multimedia resource search 10 stored in the memory 11:
respectively extracting character contents from various different types of texts to obtain one or more text segments, storing the text segments in a preset database, and segmenting each text segment to obtain a first keyword of each text segment;
constructing an inverted index table of word search according to the first keyword, and storing the classification label of each text segment to the inverted index table to construct a multimedia library;
receiving a query request sent by a user side, extracting a second keyword from the query request, searching a classification label of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading a corresponding text segment from the preset database according to the classification label obtained by retrieval;
and scoring the similarity among the text fragments, sorting the obtained scoring values according to a preset sorting sequence, selecting a preset number of text fragments according to the sorting sequence, rendering the text fragments into corresponding texts, and outputting the texts to the user side.
The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.
For the detailed description of the above steps, please refer to the above description of fig. 2 regarding a functional block diagram of an embodiment of the multimedia resource searching apparatus 100 and fig. 1 regarding a flowchart of an embodiment of a multimedia resource searching method.
In addition, the embodiment of the present invention further provides a computer-readable medium, which may be non-volatile or volatile. The computer readable medium may be any one or any combination of hard disks, multi-media cards, SD cards, flash memory cards, SMC, read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), portable compact disc read only memory (CD-ROM), USB memory, and the like. The computer readable medium includes a data storage area and a program storage area, the data storage area stores data created according to the use of the blockchain node, the program storage area stores a multimedia resource 10, and the multimedia resource search 10 realizes the following operations when being executed by a processor:
respectively extracting character contents from various different types of texts to obtain one or more text segments, storing the text segments in a preset database, and segmenting each text segment to obtain a first keyword of each text segment;
constructing an inverted index table of word search according to the first keyword, and storing the classification label of each text segment to the inverted index table to construct a multimedia library;
receiving a query request sent by a user side, extracting a second keyword from the query request, searching a classification label of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading a corresponding text segment from the preset database according to the classification label obtained by retrieval;
and scoring the similarity among the text fragments, sorting the obtained scoring values according to a preset sorting sequence, selecting a preset number of text fragments according to the sorting sequence, rendering the text fragments into corresponding texts, and outputting the texts to the user side.
The specific implementation of the computer readable medium of the present invention is substantially the same as the specific implementation of the multimedia resource searching method, and is not described herein again.
In another embodiment, in order to further ensure the privacy and security of all the data, all the data may be stored in a node of a block chain. Such as a first keyword, a second keyword, which may be stored in a block link point.
It should be noted that the blockchain in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of another identical element in a process, apparatus, article, or method comprising the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention essentially or contributing to the prior art can be embodied in the form of a software product, which is stored in a medium (such as ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (such as a mobile phone, a computer, an electronic device, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for searching multimedia resources, the method comprising:
respectively extracting character contents from various different types of texts to obtain one or more text segments, storing the text segments in a preset database, and segmenting each text segment to obtain a first keyword of each text segment;
constructing an inverted index table of word search according to the first keyword, and storing the classification label of each text segment to the inverted index table to construct a multimedia library;
receiving a query request sent by a user side, extracting a second keyword from the query request, searching a classification label of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading a corresponding text segment from the preset database according to the classification label obtained by retrieval;
and scoring the similarity among the text fragments, sorting the obtained scoring values according to a preset sorting sequence, selecting a preset number of text fragments according to the sorting sequence, rendering the text fragments into corresponding texts, and outputting the texts to the user side.
2. The method for searching multimedia resources according to claim 1, wherein the plurality of different types of texts include web page texts, PDF texts, picture texts, and video texts, and the extracting text contents from the plurality of different types of texts respectively to obtain one or more text segments and storing the one or more text segments in a preset database includes:
dividing each type of text into a format part and a text content part, and performing segment division on the text content part to obtain one or more text segments and storing the text segments in a preset database.
3. The method for searching for multimedia resources according to claim 1, wherein the segmenting words for each text segment to obtain the first keyword of each text segment comprises:
dividing the long text sentence of each text segment according to a preset word segmentation algorithm to obtain a plurality of word groups;
and calculating similarity values between adjacent phrases, and taking the phrases with the similarity values smaller than a preset threshold value as first keywords.
4. The method for searching for multimedia resources according to claim 1, wherein after said constructing an inverted index table of word searches based on said first keyword, the method further comprises:
counting word frequency values of the first keywords appearing in the corresponding text segments;
comparing the word frequency value with a preset word frequency value, and if the word frequency value is greater than or equal to the preset word frequency value, filling the first keyword into a high-frequency word queue in the inverted index table;
and if the word frequency value is smaller than a preset word frequency value, filling the first keyword into a low-frequency word queue in the inverted index table.
5. The method of claim 1, wherein before storing the classification tags of the text segments in the inverted index table to construct a multimedia library, the method further comprises:
reading a text sequence of a first keyword of each text segment, inputting the text sequence into a preset classification model for marking and embedding to obtain word vector characteristics;
matching the classification labels of the text segments from the label modules of the preset classification models according to the word vector characteristics, and establishing a mapping relation between the classification labels and the first keywords of the text segments.
6. The method for searching for multimedia resources according to claim 1, wherein said extracting the second keyword from the query request comprises:
cutting words of the information of the query request to obtain a plurality of participles;
and generating a dictionary tree according to a pre-constructed dictionary word list, and inputting the plurality of participles into the dictionary tree for traversal to obtain the second keyword.
7. The method for searching for multimedia resources according to claim 1, wherein said searching for the category label of the text segment corresponding to the first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading the corresponding text segment from the preset database according to the retrieved category label comprises:
inputting the second keyword into a search engine of the inverted index table;
traversing the first keywords in the inverted index table according to the search engine to obtain first keywords associated with the second keywords;
and reading the classification label of the associated first keyword according to the mapping relation, and reading the corresponding text segment from the preset database according to the retrieved classification label.
8. An apparatus for searching multimedia resources, the apparatus comprising:
an extraction module: the system comprises a database, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for extracting word contents from various different types of texts respectively to obtain one or more text segments, storing the one or more text segments to a preset database, and segmenting each text segment to obtain a first keyword of each text segment;
a storage module: the reverse index table is used for building word search according to the first key words, and the classification labels of the text segments are stored in the reverse index table to build a multimedia library;
the query module: the system comprises a multimedia library, a preset database and a query request sending module, wherein the multimedia library is used for receiving the query request sent by a user end, extracting a second keyword from the query request, searching a classification label of a text segment corresponding to a first keyword associated with the second keyword in the multimedia library according to the inverted index table and the second keyword, and reading the corresponding text segment from the preset database according to the classification label obtained by searching;
an output module: and the system is used for grading the similarity among the text fragments, sequencing the obtained grading values according to a preset sequencing sequence, selecting a preset number of text fragments according to the sequencing sequence, rendering the text fragments into corresponding texts and outputting the texts to the user side.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a program executable by the at least one processor, the program being executed by the at least one processor to enable the at least one processor to perform the multimedia asset searching method according to any one of claims 1 to 7.
10. A computer-readable medium, characterized in that the computer-readable medium stores a multimedia resource, and the multimedia resource, when executed by a processor, implements the multimedia resource searching method according to any one of claims 1 to 7.
CN202210855628.4A 2022-07-20 2022-07-20 Multimedia resource searching method, device, equipment and medium Pending CN115203445A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210855628.4A CN115203445A (en) 2022-07-20 2022-07-20 Multimedia resource searching method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210855628.4A CN115203445A (en) 2022-07-20 2022-07-20 Multimedia resource searching method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115203445A true CN115203445A (en) 2022-10-18

Family

ID=83582030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210855628.4A Pending CN115203445A (en) 2022-07-20 2022-07-20 Multimedia resource searching method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115203445A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578666A (en) * 2023-07-12 2023-08-11 拓尔思信息技术股份有限公司 Segment sentence position inverted index structure design and limited operation full text retrieval method thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578666A (en) * 2023-07-12 2023-08-11 拓尔思信息技术股份有限公司 Segment sentence position inverted index structure design and limited operation full text retrieval method thereof
CN116578666B (en) * 2023-07-12 2023-09-22 拓尔思信息技术股份有限公司 Segment sentence position inverted index structure design and limited operation full text retrieval method thereof

Similar Documents

Publication Publication Date Title
CN109145153B (en) Intention category identification method and device
US11514235B2 (en) Information extraction from open-ended schema-less tables
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
US7386438B1 (en) Identifying language attributes through probabilistic analysis
US8577882B2 (en) Method and system for searching multilingual documents
TWI536181B (en) Language identification in multilingual text
US7523102B2 (en) Content search in complex language, such as Japanese
US11580181B1 (en) Query modification based on non-textual resource context
US10423649B2 (en) Natural question generation from query data using natural language processing system
CN111881307A (en) Demonstration manuscript generation method and device, computer equipment and storage medium
US10366154B2 (en) Information processing device, information processing method, and computer program product
WO2020056977A1 (en) Knowledge point pushing method and device, and computer readable storage medium
CN112395420A (en) Video content retrieval method and device, computer equipment and storage medium
CN108121715B (en) Character labeling method and character labeling device
CN106980664B (en) Bilingual comparable corpus mining method and device
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN111522901A (en) Method and device for processing address information in text
CN112818200A (en) Data crawling and event analyzing method and system based on static website
US11520835B2 (en) Learning system, learning method, and program
CN115203445A (en) Multimedia resource searching method, device, equipment and medium
CN105808615A (en) Document index generation method and device based on word segment weights
CN114297143A (en) File searching method, file displaying device and mobile terminal
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN110489528B (en) Electronic dictionary reconstruction method based on electronic book content and computing equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination