CN116910054A - Data processing method, device, electronic equipment and computer readable storage medium - Google Patents

Data processing method, device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN116910054A
CN116910054A CN202310559630.1A CN202310559630A CN116910054A CN 116910054 A CN116910054 A CN 116910054A CN 202310559630 A CN202310559630 A CN 202310559630A CN 116910054 A CN116910054 A CN 116910054A
Authority
CN
China
Prior art keywords
scene
scene data
data
library
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310559630.1A
Other languages
Chinese (zh)
Inventor
付姣姣
孟繁宇
王惠欣
钞芳宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN202310559630.1A priority Critical patent/CN116910054A/en
Publication of CN116910054A publication Critical patent/CN116910054A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a data processing device, electronic equipment and a computer readable storage medium, relates to the technical field of Internet, and aims to solve the problem that the matching accuracy of the existing data matching method is low. The method comprises the following steps: acquiring a description sentence input by a user; determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database; searching M pieces of first scene data matched with the description statement from a target sub-library; calculating the text similarity of the first scene data and the descriptive statement for each piece of the first scene data; and displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data. According to the embodiment of the application, the matched sub-library is locked through the scene data types, and the matched scene data is determined by adopting a double search mechanism, so that the matching accuracy can be improved.

Description

Data processing method, device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a computer readable storage medium.
Background
The data matching is to input descriptive text through a user and return data with higher similarity with the user input. In the prior art, data are usually indexed, and then description sentences input by a user are input into a search engine to realize a retrieval function, but the matching accuracy of the retrieval is low.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium, which are used for solving the problem of low matching accuracy of the existing data matching method.
In a first aspect, an embodiment of the present application provides a data processing method, including:
acquiring a description sentence input by a user;
determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database;
searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1;
calculating the text similarity of the first scene data and the description statement aiming at each piece of the first scene data;
and displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data, wherein N is a positive integer smaller than M.
Optionally, the searching M pieces of first scene data matched with the description statement from the target sub-library includes:
matching the description sentences with index information of each piece of scene data in the target sub-library, and determining matching scores of each piece of scene data in the target sub-library and the description sentences;
and selecting the first M pieces of first scene data with the largest matching score with the description sentences in the target sub-library.
Optionally, the calculating the text similarity between the first scene data and the description sentence includes:
extracting features of the first scene data to obtain a first sentence characterization vector;
extracting features of the description sentences to obtain second sentence characterization vectors;
pooling the first sentence characterization vector to obtain a first sentence vector;
pooling the second sentence characterization vector to obtain a second sentence vector;
and calculating cosine similarity of the first sentence vector and the second sentence vector as text similarity of the first scene data and the descriptive sentence.
Optionally, before the acquiring the description sentence input by the user, the method further includes:
Acquiring scene data to be stored;
and storing the scene data to be stored into a sub-library of scene categories to which the scene data to be stored belongs.
Optionally, the storing the to-be-stored scene data in a sub-library of scene categories to which the to-be-stored scene data belongs includes:
under the condition that each piece of scene data in the scene data to be stored belongs to single scene data, storing the scene data to be stored into a sub-library of a first scene category according to the first scene category to which the scene data to be stored belongs;
and under the condition that each piece of scene data of the scene data to be stored belongs to multi-scene data, classifying the scene data to be stored to obtain P types of scene data, and respectively storing the P types of scene data into sub-libraries corresponding to P scene categories, wherein the sub-library of each scene category stores the scene data corresponding to the scene category, and P is an integer larger than 1.
Optionally, the storing the to-be-stored scene data to a scene category to which the to-be-stored scene data belongs includes:
segmenting a second scene data to generate an index file of the second scene data, wherein the index file is recorded with identification of the second scene data, words and word frequency information in the second scene data, and the second scene data is any piece of scene data in the scene data to be stored;
And storing the second scene data and the index file in a sub-library of scene categories to which the second scene data belongs.
Optionally, each sub-library in the scene database is provided with a scene index; the determining, from a pre-established scene database, a target sub-library corresponding to a target scene type to which the description sentence belongs, includes:
and matching the target scene type with scene indexes of all sub-libraries in the scene database, and determining the sub-library with successfully matched scene indexes with the target scene type as the target sub-library.
In a second aspect, an embodiment of the present application further provides a data processing apparatus, including:
the first acquisition module is used for acquiring the descriptive statement input by the user;
the determining module is used for determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database;
the searching module is used for searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1;
the computing module is used for computing the text similarity between the first scene data and the description sentences for each piece of the first scene data;
And the display module is used for displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data, wherein N is a positive integer smaller than M.
Optionally, the search module includes:
the determining unit is used for matching the description sentence with index information of each piece of scene data in the target sub-library and determining a matching score of each piece of scene data in the target sub-library and the description sentence;
and the selecting unit is used for selecting the first M pieces of first scene data with the largest matching score with the description sentence in the target sub-library.
Optionally, the computing module includes:
the first feature extraction unit is used for carrying out feature extraction on the first scene data to obtain a first sentence characterization vector;
the second feature extraction unit is used for carrying out feature extraction on the description sentence to obtain a second sentence characterization vector;
the first pooling processing unit is used for pooling the first sentence characterization vector to obtain a first sentence vector;
the second pooling processing unit is used for pooling the second sentence characterization vector to obtain a second sentence vector;
and the calculating unit is used for calculating cosine similarity of the first sentence vector and the second sentence vector and taking the cosine similarity as text similarity of the first scene data and the descriptive sentence.
Optionally, the data processing apparatus further includes:
the second acquisition module is used for acquiring scene data to be stored;
and the storage module is used for storing the scene data to be stored into a sub-library of scene categories to which the scene data to be stored belong.
Optionally, the storage module is configured to store the to-be-stored scene data into a sub-library of a first scene category according to the first scene category to which the to-be-stored scene data belongs, when it is determined that each piece of the to-be-stored scene data belongs to a single scene data.
Optionally, the data processing apparatus further includes:
the classification module is used for classifying the scene data to be stored to obtain P-type scene data under the condition that each piece of scene data in the scene data to be stored is determined to belong to multi-scene data;
the storage module is used for respectively storing the P-type scene data into sub-libraries corresponding to P scene categories, wherein the sub-library of each scene category stores the scene data corresponding to the scene category, and P is an integer greater than 1.
Optionally, the storage module includes:
the generating unit is used for segmenting second scene data to generate an index file of the second scene data, wherein the index file is recorded with identification of the second scene data, words and word frequency information in the second scene data, and the second scene data is any piece of scene data in the scene data to be stored;
And the storage unit is used for storing the second scene data and the index file in a sub-library of the scene category to which the second scene data belongs.
Optionally, each sub-library in the scene database is provided with a scene index; the determining module is used for matching the target scene type with the scene indexes of all the sub-libraries in the scene database, and determining the sub-library with the scene index successfully matched with the target scene type as the target sub-library.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the steps in the data processing method as described above when executing the computer program.
In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a data processing method as described above.
In the embodiment of the application, a description sentence input by a user is acquired; determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database; searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1; calculating the text similarity of the first scene data and the description statement aiming at each piece of the first scene data; and displaying the first N first scene numbers with the maximum text similarity with the descriptive statement in the M first scene data, wherein N is a positive integer smaller than M. In this way, the matching sub-library is locked through the scene data types, and the matching scene data is determined by adopting a double search mechanism combined with text similarity calculation, so that the relation between sentences can be captured well, and the matching accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
FIG. 1 is one of the flowcharts of a data processing method provided by an embodiment of the present application;
FIG. 2 is a schematic structural diagram based on a Sentence-Bert model according to an embodiment of the present application;
FIG. 3 is a second flowchart of a data processing method according to an embodiment of the present application;
FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order to make the embodiments of the present application more clear, the following description will be given to the related technical knowledge related to the embodiments of the present application:
the scene data matching essence is that a description text is input by a user, scene data with higher similarity with the text input by the user is returned, and a relational and non-relational database-based method, a deep learning method and a distributed search engine are mainly adopted in the prior art.
In the method based on the relational database and the non-relational database, the index built in the relational database is generally only suitable for processing structured data, while the non-relational database can process unstructured data, long text matching is difficult to realize, the method is not suitable for massive text data retrieval, and the method has the problems that the retrieval speed is low, the content required by a user cannot be obtained through a certain keyword, and the like.
The deep learning method is to obtain text matching results by using a deep learning text similarity model, and if the parameters are too large, prediction and reasoning speeds are slow, so that a large amount of calculation cost is caused.
The distributed search engine increases the data storage capacity by storing the data in each node of the distributed system, but the search is essentially matching of 'words', so that semantic search cannot be realized, and the matching accuracy is low.
Aiming at the technical problems, the application provides a hybrid scene matching method based on a dual search mechanism. The method is based on a dual search mechanism, can improve the matching accuracy, and is compatible with single-scene and multi-scene matching targets.
The data processing method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a flowchart of a data processing method according to an embodiment of the present application, as shown in fig. 1, including the following steps:
and step 101, acquiring a description sentence input by a user.
The description sentence may be a description word input by a user and used for retrieving related information, and specifically, the user may input the related description sentence on a retrieval interface, where the retrieval interface may be a web search interface or other interfaces that need to retrieve related information. For example, when the user wants to find knowledge about the floral vegetables, the user may input "floral vegetables", "which floral vegetables are among" and the like description sentences in the search interface; when the user wants to know how to use the application for the purpose, a description sentence "how to use the application for the purpose" may be input in the search interface.
After the description sentences input by the user are acquired, the description sentences can be used as query sentences to search the data matched with the description sentences in the database, and search results are obtained and returned to the user.
Optionally, before the step 101, the method further includes:
acquiring scene data to be stored;
and storing the scene data to be stored into a sub-library of scene categories to which the scene data to be stored belongs.
In one embodiment, the obtained scene data may be stored in the corresponding sub-databases according to the scene types, so as to establish a scene database.
The obtaining the to-be-stored scene data may be obtaining the to-be-stored scene data by crawling specified webpage data and performing deduplication processing according to a certain rule by a web crawler program, where the to-be-stored scene data may include a plurality of pieces of scene data. Specifically, a crawled web page uniform resource locator (Uniform Resource Locator, URL) may be set first, then a hypertext transfer protocol (Hyper Text Transfer Protocol, HTTP) request is sent in a simulation manner and the URL is crawled, a hypertext markup language (Hyper Text Markup Language, HTML) source code of a web page is obtained, a script removal and style removal preprocessing operation is performed on the web page source code, a Jsoup standard library is used to analyze web page data, the Jsoup standard library analyzes the preprocessed web page into a node (Dom) tree, the weight of each node is calculated respectively, and the score of each node is calculated according to the following function, namely:
score=log(SD)*ND i *log10(PNum i +2)*log(SbD i )
Wherein SD represents standard deviation of text density, ND i Representing text density, PNum, of node i i Number of P-tags representing node i SbD i Punctuation density representing node i.
And finding out the node with the maximum node score value according to the score of each node calculated by the formula, wherein the node is the maximum node containing the text. After the node elements of the text are positioned, all text contents can be extracted by using Jso, and the extracted contents of specific scenes can be set according to actual requirements.
After extracting the scene data to be stored, analyzing the scene data to be stored, judging the scene type of the scene data to be stored, judging whether a sub-library of the scene type exists in a scene database, if so, directly storing the scene data to be stored into the sub-library of the scene type to which the scene data to be stored belongs, if not, establishing the sub-library of the scene type, setting a scene index, and then storing the scene data to be stored into the sub-library.
In this way, the obtained scene data to be stored is stored into the corresponding sub-library according to the scene category, so that the matched sub-library can be conveniently determined according to the scene category index during subsequent retrieval.
Optionally, the storing the to-be-stored scene data in a sub-library of scene categories to which the to-be-stored scene data belongs includes:
under the condition that each piece of scene data in the scene data to be stored belongs to single scene data, storing the scene data to be stored into a sub-library of a first scene category according to the first scene category to which the scene data to be stored belongs;
and under the condition that each piece of scene data in the scene data to be stored belongs to multi-scene data, classifying the scene data to be stored to obtain P types of scene data, and respectively storing the P types of scene data into sub-libraries corresponding to P scene categories, wherein the sub-library of each scene category stores the scene data corresponding to the scene category, and P is an integer larger than 1.
In a specific embodiment, the acquired scene data to be stored may be analyzed and processed to determine whether the scene data to be stored belongs to a single scene or a multi-scene mixture, specifically, the scene category of each piece of scene data in the scene data to be stored may be identified, and then whether each piece of scene data in the scene data to be stored is the same category or a plurality of different categories may be determined, if the scene data to be stored belongs to the single scene, if the scene data to be stored belongs to a plurality of different categories, then the scene data to be stored may be determined to belong to the multi-scene mixture.
If it is determined that each piece of the to-be-stored scene data belongs to a single scene data, for example, a certain type of scene data, for example, the to-be-stored scene data only includes news scene data, the to-be-stored scene data may be directly stored into a sub-library corresponding to the scene category, where each piece of the stored scene data mainly includes a scene data title, a scene data source (such as a web page URL of the source), and a scene data content.
If it is determined that each piece of scene data in the scene data to be stored belongs to multi-scene data, for example, the scene data to be stored contains multi-category scene data such as news category, business category, sports category, cultural entertainment category, and the like, the scene data to be stored needs to be classified to obtain a plurality of different category scene data, specifically, a corpus can be built according to the multi-scene data, a Fasttext model is adopted to realize multi-scene data classification, then the classified P category scene data are respectively stored in sub-libraries corresponding to the P category scene data, namely, the scene data of a certain category scene is the same as the scene category of the sub-libraries, the scene data of the certain category scene is stored in the sub-libraries of the category scene, for example, the classified news category scene data is stored in the sub-libraries of the news category, the classified business category scene data is stored in the sub-libraries of the business category, the classified sports category scene data is stored in the sub-libraries of the sports category, and the classified cultural entertainment category scene data is stored in the sub-libraries of the cultural entertainment category. Each piece of scene data after classified storage mainly comprises a respective scene data title, a scene data source and scene data content.
In some embodiments, an elastosearch-based distributed search engine may be employed to store the scene data to be stored. The method provides a distributed multi-user-capability full text search engine Apache Lucene (TM), which can be used for carrying out distributed real-time file storage and real-time high-efficiency full text search, and processing structured data and unstructured data in a byte (PB) level. And different indexes can be set according to different types of data, and a plurality of scene data stores in the same index are stored in different partitions (shards) according to a routing algorithm.
For example, a news class index, a business class index, a sports class index, a cultural entertainment index, an information technology class index, and the like can be set, wherein one piece of scene data information under each scene class contains fields such as a scene data title, a scene data source, scene data content, and the like, and different acquired scene data are stored in the corresponding scene indexes according to an elastic search storage structure, so that a screening range is conveniently provided for a specific description sentence uploaded by a user.
In this way, in this embodiment, by classifying the scene data to be stored according to the categories and storing the scene data in the sub-libraries corresponding to the scene indexes, not only a large amount of scene data can be orderly classified and stored, but also the subsequent retrieval and locking of the screening range are facilitated, and the matching accuracy is ensured.
Optionally, the storing the to-be-stored scene data to a scene category to which the to-be-stored scene data belongs includes:
segmenting a second scene data to generate an index file of the second scene data, wherein the index file is recorded with identification of the second scene data, words and word frequency information in the second scene data, and the second scene data is any piece of scene data in the scene data to be stored;
and storing the second scene data and the index file in a sub-library of scene categories to which the second scene data belongs.
In one embodiment, when each piece of scene data in the scene data to be stored is stored, two pieces of content may be stored, one piece being the original content of the piece of scene data for returning the matched scene data to the user, and one being an index file generated by word segmentation for providing index information when the matched scene data is retrieved.
Specifically, when a document, i.e., each piece of scene data, is written to an elastisesearch, the elastisesearch by default holds two pieces of content, one being the original content of the document and one being an inverted index file generated by word segmentation. The method comprises the steps that firstly, an IK Chinese word segmentation device is adopted for segmenting content fields in a document, and the occurrence frequency of each word in the document is recorded, wherein the IK word segmentation device is based on a dictionary and rules, then, an inverted index is generated and stored in an index library, and in an inverted index list corresponding to each word, not only is the document number, namely scene data identification recorded, but also word frequency information, namely the occurrence frequency of the word in a certain document is recorded. When the scene data matched with the descriptive sentences input by the user is searched according to the index information, the similarity can be calculated according to the word frequency information of each word in the matched scene data, for example, the word in a certain scene data is matched with the descriptive sentences, and the higher the word frequency, the higher the matching degree or the matching score.
Thus, in this embodiment, by generating the index file for each piece of scene data, it is possible to facilitate quick retrieval and calculation of the similarity from the index file when matching scene data is retrieved later.
Step 102, determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database.
After the description sentence is acquired, the target scene type to which the description sentence belongs can be determined first, for example, the scene keyword in the description sentence is identified, and the corresponding scene type is determined. And then determining a corresponding target sub-library according to the target scene type, namely determining the sub-library of the target scene type, wherein the target sub-library stores scene data of the target scene type.
In the embodiment of the application, in order to ensure that the database corresponding to the scene type can be queried according to the scene type, so as to reduce the query range and ensure the matching precision, the scene database formed by a plurality of sub-databases of different scene types can be pre-established, and the scene data of different scene types can be respectively stored in the different sub-databases. Specifically, a distributed search engine may be used to store various scene data, and the obtained various scene data are classified according to scene types and then stored in different sub-libraries, and scene type tags or index identifiers are allocated to the sub-libraries.
Optionally, each sub-library in the scene database is provided with a scene index; the step 102 includes:
and matching the target scene type with scene indexes of all sub-libraries in the scene database, and determining the sub-library with the scene index matched with the target scene type as the target sub-library.
That is, in one embodiment, when various types of scene data are stored, a scene index may be set for each sub-library of the scene categories, which is used for retrieving the corresponding sub-library according to the scene type, for example, according to the scene data type stored in each sub-library, generating the scene index of the corresponding type, then setting the scene index consistent with the scene data type stored in each sub-library, for example, the news scene data stored in the sub-library 1, generating the scene index of the news category, setting the scene index of the sub-library 1 as the news category, setting the scene index of the sub-library 2 as the information technology scene data, generating the scene index of the sub-library 2 as the information technology, setting the scene index of the sub-library 3 as the cultural entertainment scene data, generating the scene index of the cultural entertainment category, setting the scene index of the sub-library 3 as the cultural entertainment category, and so on.
In this way, after the description sentence input by the user is obtained, the target scene type to which the description sentence belongs can be respectively matched with the scene indexes of all the sub-libraries in the scene database, and the sub-library successfully matched is the target sub-library of the target scene type.
In this way, by means of scene index matching, it can be ensured that the target sub-library where the scene data matched with the current description statement is located can be determined rapidly and accurately.
And step 103, searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1.
After determining the target sub-library of the target scene type, a plurality of pieces of first scene data matched with the description sentence can be retrieved from the target sub-library, wherein matching can be that the similarity with the description sentence is higher or keywords in the description sentence are included, the number M of pieces of matching scene data to be searched can be determined according to the preset number of returned result pieces, and in order to improve the matching accuracy, the number M can be a plurality of times of the preset number of returned result pieces. For example, if the number of scene data pieces to be returned is 10, M may be 3 times 10, i.e. 30 pieces of matching first scene data need to be found.
Optionally, the step 103 includes:
matching the description sentences with index information of each piece of scene data in the target sub-library, and determining matching scores of each piece of scene data in the target sub-library and the description sentences;
and selecting the first M pieces of first scene data with the largest matching score with the description sentences in the target sub-library.
In one embodiment, to facilitate matching, index information may be created for each piece of scene data stored in the sub-library, and the index information may be a keyword in each piece of scene data, or a word with a higher frequency of occurrence. In this way, when the scene data matched with the description sentence is queried, the description sentence may be matched with index information of each piece of scene data in the target sub-library, and the matching degree of each piece of scene data in the target sub-library and the description sentence may be determined, so that a plurality of pieces of first scene data matched with the description sentence, for example, K pieces of first scene data, for example, the matching degree is higher than 60%, may be determined according to the matching degree, and K is the number of the matched first scene data, and the matching score of each piece of first scene data in the K pieces of first scene data may be recorded, where the matching score may be represented by the matching degree or converted according to the matching degree.
And then the K pieces of first scene data can be sequenced according to the sequence of the matching scores from high to low, M pieces of first scene data before sequencing are selected as first retrieval results, so that scene data with higher matching degree with the user description sentences are screened, and the accuracy of the retrieval results is further ensured.
Under the condition that the elastisesearch engine is used for storing scene data, an application program interface (Application Programming Interface, API) interface provided by the elastisesearch engine can be called, scene description sentences input by a user are used as query sentences, a query and extraction (query fetch) search mode is adopted, a request is sent to all scene indexes in a scene database according to the query sentences, a matched target sub-library is found, and document IDs (excluding document content) with higher similarity to the query sentences, namely scene data identification and ranking related information (such as corresponding scores of documents) are searched from the target sub-library. And then, sorting according to the document scores returned by the various shards in the target sub-library, taking the first M documents, and then acquiring corresponding document contents according to the shards with decorrelation of the returned document IDs.
For example, the user inputs "how to use application for splitting", and the similar scene data to be returned is 10 pieces, the elastisesearch engine queries each piece of scene data title stored in the information technology class index base according to the description sentence input by the user, returns scene data title information (the number of returned pieces is determined according to the size of the index base) with high-to-low similarity to the query sentence according to the similarity, such as "how to use video of application for splitting", "how to use tutorials for application for splitting", "handles teach you use application for splitting", "one-time to understand application for splitting", etc., and then can extract the first 30 pieces of similar scene data information including scene data title, source and content.
Thus, with this embodiment, the description sentence can be quickly and accurately matched according to the index information of the scene data.
Step 104, calculating the text similarity between the first scene data and the description sentence according to each piece of the first scene data.
In order to obtain a more accurate matching result, a text similarity algorithm may be used to calculate the text similarity between each piece of first scene data in the M pieces of first scene data and the description sentence, for example, word embedding (embedding) may be performed on each piece of first scene data and the description sentence to obtain a corresponding vector, and then similarity between the vector corresponding to each piece of first scene data and the vector corresponding to the description sentence may be compared to obtain the text similarity between each piece of first scene data and the description sentence.
And 105, displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data, wherein N is a positive integer smaller than M.
After determining the text similarity between each piece of first scene data in the M pieces of first scene data and the description sentence, arranging the M pieces of first scene data according to the sequence of the text similarity from high to low, selecting M pieces of first scene data with the top M pieces of text similarity ranks as a second retrieval result, returning to an interface for inputting the description sentence by a user, for example, returning to a retrieval interface, namely displaying the M pieces of first scene data in a retrieval result display area of the retrieval interface.
Optionally, the step 104 includes:
extracting features of the first scene data to obtain a first sentence characterization vector;
extracting features of the description sentences to obtain second sentence characterization vectors;
pooling the first sentence characterization vector to obtain a first sentence vector;
pooling the second sentence characterization vector to obtain a second sentence vector;
and calculating cosine similarity of the first sentence vector and the second sentence vector as text similarity of the first scene data and the descriptive sentence.
In one embodiment, in order to calculate the text similarity between each piece of first scene data in the M pieces of first scene data and the description sentence, feature extraction may be performed on each piece of first scene data in the M pieces of first scene data to obtain a corresponding sentence representation vector, and feature extraction may be performed on the description sentence to obtain a corresponding sentence representation vector. For example, each piece of first scene data and the description sentence may be input into two Bert models for feature extraction, so as to obtain a sentence representation vector corresponding to each piece of first scene data output by the Bert model, and a sentence representation vector corresponding to the description sentence.
Considering that the text length of each first scene data is inconsistent with that of the description sentence, a pooling strategy can be adopted to further extract features of sentence characterization vectors corresponding to each first scene data and the description sentence, an average pooling strategy can be adopted to perform average value solving operation on the sentence characterization vectors output by the Bert model, a corresponding average value vector is obtained, and the average value vector is used as a sentence vector of the whole sentence.
And finally, respectively calculating cosine similarity for the sentence vector of each piece of first scene data and the sentence vector of the description sentence to obtain the text similarity of each piece of first scene data and the description sentence.
In some embodiments, a Sentence-Bert model may be used to complete the text similarity calculation between each piece of the first scene data and the description statement, where the Sentence-Bert model is derived from a twin Network (Siamese Network), and its main purpose is to measure the similarity between two pieces of input text, and the structure of the Sentence-Bert model is shown in fig. 2. For example, the description Sentence "how to use the application to divide" input by the user may be input to the Sentence-Bert model, and one of 30 pieces of similar scene data returned by the elastisesearch search engine may be input, the two sentences are processed by Bert and pooling (pooling), the corresponding Sentence vectors are output, the cosine similarity of the Sentence vectors of the two sentences is calculated and recorded, and then the total of 30 times is calculated, and then 10 pieces of scene data with the highest similarity are filtered and returned according to the recorded similarity result.
In this way, by calculating the text similarity between each piece of first scene data and the description sentence according to the embodiment, the relation between sentences can be captured better, and the semantic similarity of the two sentences can be obtained, so that the similarity calculation result is more accurate.
In combination with the above embodiment, the flow of the hybrid scene matching method based on the dual search mechanism provided by the embodiment of the present application may be as shown in fig. 3. Namely, the method comprises the following steps: step one, acquiring scene data; step two, analyzing and processing scene data, judging the scene type, and if the scene is a single scene, directly storing the single scene data by adopting a distributed search engine elastic search; if the scene is multiple, firstly, using a Fasttext model to classify the scene data, and then adopting a distributed search engine elastic search to construct different scene index structure storage data; step three, storing data based on an elastic search distributed search engine; step four, acquiring scene description sentences input by a user; step five, calling an API interface provided by an elastic search engine, and obtaining a first search result through elastic search full-text retrieval according to a description statement input by a user; and step six, analyzing and processing the first search result by adopting a Sentence-Bert model, and returning a final matching result by comparing the similarity of the description statement and the first search result.
According to the embodiment of the application, a dual search mechanism is introduced, an elastic search engine is used for acquiring a first search result, analysis processing is performed based on a Sentence-Bert model, and a final matching result is returned by comparing the similarity of a user input description Sentence and the first search result, so that long-term Sentence description can be compatible, and scene matching precision is improved; the matching method for single-scene and multi-scene data is provided, so that the expandability of the scene can be improved, multiple fields are adapted, and the user satisfaction is improved; the Sentence-Bert model which can better capture the relation between sentences is introduced, so that the reasoning speed and the matching accuracy are improved.
According to the data processing method, the description sentences input by a user are acquired; determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database; searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1; calculating the text similarity of the first scene data and the description statement aiming at each piece of the first scene data; and displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data, wherein N is a positive integer smaller than M. In this way, the matching sub-library is locked through the scene data types, and the matching scene data is determined by adopting a double search mechanism combined with text similarity calculation, so that the relation between sentences can be captured well, and the matching accuracy is improved.
The embodiment of the application also provides a data processing device. Referring to fig. 4, fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present application. Since the principle of the data processing apparatus for solving the problem is similar to that of the data processing method in the embodiment of the present application, the implementation of the data processing apparatus may refer to the implementation of the method, and the repetition is not repeated.
As shown in fig. 4, the data processing apparatus 400 includes:
a first obtaining module 401, configured to obtain a description sentence input by a user;
a determining module 402, configured to determine, from a pre-established scene database, a target sub-library corresponding to a target scene type to which the description sentence belongs;
a searching module 403, configured to search M pieces of first scene data matched with the description sentence from the target sub-library, where M is an integer greater than 1;
a calculating module 404, configured to calculate, for each piece of the first scene data, a text similarity between the first scene data and the description sentence;
and a display module 405, configured to display the first N pieces of first scene data with the maximum text similarity with the description sentence in the M pieces of first scene data, where N is a positive integer less than M.
Optionally, the search module 403 includes:
The determining unit is used for matching the description sentence with index information of each piece of scene data in the target sub-library and determining a matching score of each piece of scene data in the target sub-library and the description sentence;
and the selecting unit is used for selecting the first M pieces of first scene data with the largest matching score with the description sentence in the target sub-library.
Optionally, the computing module 404 includes:
the first feature extraction unit is used for carrying out feature extraction on the first scene data to obtain a first sentence characterization vector, wherein the first scene data is any one piece of scene data in the M pieces of scene data;
the second feature extraction unit is used for carrying out feature extraction on the description sentence to obtain a second sentence characterization vector;
the first pooling processing unit is used for pooling the first sentence characterization vector to obtain a first sentence vector;
the second pooling processing unit is used for pooling the second sentence characterization vector to obtain a second sentence vector;
and the calculating unit is used for calculating cosine similarity of the first sentence vector and the second sentence vector and taking the cosine similarity as text similarity of the first scene data and the descriptive sentence.
Optionally, the data processing apparatus 400 further comprises:
the second acquisition module is used for acquiring scene data to be stored;
and the storage module is used for storing the scene data to be stored into a sub-library of scene categories to which the scene data to be stored belong.
Optionally, the storage module is configured to store the to-be-stored scene data into a sub-library of a first scene category according to the first scene category to which the to-be-stored scene data belongs, when it is determined that each piece of the to-be-stored scene data belongs to a single scene data.
Optionally, the data processing apparatus 400 further comprises:
the classification module is used for classifying the scene data to be stored to obtain P-type scene data under the condition that each piece of scene data in the scene data to be stored is determined to belong to multi-scene data;
the storage module is used for respectively storing the P-type scene data into sub-libraries corresponding to P scene categories, wherein the sub-library of each scene category stores the scene data corresponding to the scene category, and P is an integer greater than 1.
Optionally, the storage module includes:
the generating unit is used for segmenting second scene data to generate an index file of the second scene data, wherein the index file is recorded with identification of the second scene data, words and word frequency information in the second scene data, and the second scene data is any piece of scene data in the scene data to be stored;
And the storage unit is used for storing the second scene data and the index file in a sub-library of the scene category to which the second scene data belongs.
Optionally, each sub-library in the scene database is provided with a scene index; the determining module 402 is configured to match the target scene type with a scene index of each sub-library in the scene database, and determine a sub-library in which the scene index is successfully matched with the target scene type as the target sub-library.
The data processing apparatus 400 provided in the embodiment of the present application may execute the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.
The data processing device 400 of the embodiment of the application acquires the description sentence input by the user; determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database; searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1; calculating the text similarity of the first scene data and the description statement aiming at each piece of the first scene data; and displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data, wherein N is a positive integer smaller than M. In this way, the matching sub-library is locked through the scene data types, and the matching scene data is determined by adopting a double search mechanism combined with text similarity calculation, so that the relation between sentences can be captured well, and the matching accuracy is improved.
The embodiment of the application also provides electronic equipment. Because the principle of solving the problem of the electronic device is similar to that of the data processing method in the embodiment of the present application, the implementation of the electronic device may refer to the implementation of the method, and the repetition is not repeated. As shown in fig. 5, a terminal according to an embodiment of the present application includes:
the processor 500, configured to read the program in the memory 520, performs the following procedures:
acquiring a description sentence input by a user;
determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database;
searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1;
calculating the text similarity of the first scene data and the description statement aiming at each piece of the first scene data;
and displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data, wherein N is a positive integer smaller than M.
Wherein in fig. 5, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 500 and various circuits of memory represented by memory 520, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The processor 500 is responsible for managing the bus architecture and general processing, and the memory 520 may store data used by the processor 500 in performing operations.
Optionally, the processor 500 is further configured to read the program in the memory 520, and perform the following steps:
matching the description sentences with index information of each piece of scene data in the target sub-library, and determining matching scores of each piece of scene data in the target sub-library and the description sentences;
and selecting the first M pieces of first scene data with the largest matching score with the description sentences in the target sub-library.
Optionally, the processor 500 is further configured to read the program in the memory 520, and perform the following steps:
extracting features of the first scene data to obtain a first sentence characterization vector;
extracting features of the description sentences to obtain second sentence characterization vectors;
pooling the first sentence characterization vector to obtain a first sentence vector;
pooling the second sentence characterization vector to obtain a second sentence vector;
and calculating cosine similarity of the first sentence vector and the second sentence vector as text similarity of the first scene data and the descriptive sentence.
Optionally, the processor 500 is further configured to read the program in the memory 520, and perform the following steps:
acquiring scene data to be stored;
And storing the scene data to be stored into a sub-library of scene categories to which the scene data to be stored belongs.
Optionally, the processor 500 is further configured to read the program in the memory 520, and perform the following steps:
under the condition that each piece of scene data in the scene data to be stored belongs to single scene data, storing the scene data to be stored into a sub-library of a first scene category according to the first scene category to which the scene data to be stored belongs;
and under the condition that each piece of scene data in the scene data to be stored belongs to multi-scene data, classifying the scene data to be stored to obtain P types of scene data, and respectively storing the P types of scene data into sub-libraries corresponding to P scene categories, wherein the sub-library of each scene category stores the scene data corresponding to the scene category, and P is an integer larger than 1.
Optionally, the processor 500 is further configured to read the program in the memory 520, and perform the following steps:
segmenting a second scene data to generate an index file of the second scene data, wherein the index file is recorded with identification of the second scene data, words and word frequency information in the second scene data, and the second scene data is any piece of scene data in the scene data to be stored;
And storing the second scene data and the index file in a sub-library of scene categories to which the second scene data belongs.
Optionally, each sub-library in the scene database is provided with a scene index; the processor 500 is also configured to read the program in the memory 520, and perform the following steps:
and matching the target scene type with scene indexes of all sub-libraries in the scene database, and determining the sub-library with successfully matched scene indexes with the target scene type as the target sub-library.
The electronic device provided by the embodiment of the present application may execute the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.
Furthermore, a computer readable storage medium of an embodiment of the present application is used to store a computer program, where the computer program may be executed by a processor to implement the steps of the method embodiment shown in fig. 1 or fig. 3.
In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform part of the steps of the transceiving method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims (10)

1. A method of data processing, comprising:
acquiring a description sentence input by a user;
determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database;
searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1;
calculating the text similarity of the first scene data and the description statement aiming at each piece of the first scene data;
and displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data, wherein N is a positive integer smaller than M.
2. The method of claim 1, wherein the searching M pieces of first scene data matching the description statement from the target sub-library comprises:
matching the description sentences with index information of each piece of scene data in the target sub-library, and determining matching scores of each piece of scene data in the target sub-library and the description sentences;
and selecting the first M pieces of first scene data with the largest matching score with the description sentences in the target sub-library.
3. The method of claim 1, wherein the calculating the text similarity of the first scene data to the descriptive statement comprises:
Extracting features of the first scene data to obtain a first sentence characterization vector;
extracting features of the description sentences to obtain second sentence characterization vectors;
pooling the first sentence characterization vector to obtain a first sentence vector;
pooling the second sentence characterization vector to obtain a second sentence vector;
and calculating cosine similarity of the first sentence vector and the second sentence vector as text similarity of the first scene data and the descriptive sentence.
4. The method of claim 1, wherein prior to the obtaining the user-entered descriptive statement, the method further comprises:
acquiring scene data to be stored;
and storing the scene data to be stored into a sub-library of scene categories to which the scene data to be stored belongs.
5. The method of claim 4, wherein storing the scene data to be stored into a sub-library of scene categories to which the scene data to be stored belongs comprises:
under the condition that each piece of scene data in the scene data to be stored belongs to single scene data, storing the scene data to be stored into a sub-library of a first scene category according to the first scene category to which the scene data to be stored belongs;
And under the condition that each piece of scene data of the scene data to be stored belongs to multi-scene data, classifying the scene data to be stored to obtain P types of scene data, and respectively storing the P types of scene data into sub-libraries corresponding to P scene categories, wherein the sub-library of each scene category stores the scene data corresponding to the scene category, and P is an integer larger than 1.
6. The method of claim 4, wherein storing the scene data to be stored into a sub-library of scene categories to which the scene data to be stored belongs comprises:
segmenting a second scene data to generate an index file of the second scene data, wherein the index file is recorded with identification of the second scene data, words and word frequency information in the second scene data, and the second scene data is any piece of scene data in the scene data to be stored;
and storing the second scene data and the index file in a sub-library of scene categories to which the second scene data belongs.
7. The method of claim 1, wherein each sub-library in the scene database is provided with a scene index; the determining, from a pre-established scene database, a target sub-library corresponding to a target scene type to which the description sentence belongs, includes:
And matching the target scene type with scene indexes of all sub-libraries in the scene database, and determining the sub-library with the scene index matched with the target scene type as the target sub-library.
8. A data processing apparatus, comprising:
the first acquisition module is used for acquiring the descriptive statement input by the user;
the determining module is used for determining a target sub-library corresponding to the target scene type to which the description statement belongs from a pre-established scene database;
the searching module is used for searching M pieces of first scene data matched with the description statement from the target sub-library, wherein M is an integer greater than 1;
the computing module is used for computing the text similarity between the first scene data and the description sentences for each piece of the first scene data;
and the display module is used for displaying the first N pieces of first scene data with the maximum text similarity with the descriptive statement in the M pieces of first scene data, wherein N is a positive integer smaller than M.
9. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor; -characterized in that the processor is arranged to read a program in a memory for implementing the steps of the data processing method according to any of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the steps in the data processing method according to any one of claims 1 to 7.
CN202310559630.1A 2023-05-18 2023-05-18 Data processing method, device, electronic equipment and computer readable storage medium Pending CN116910054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310559630.1A CN116910054A (en) 2023-05-18 2023-05-18 Data processing method, device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310559630.1A CN116910054A (en) 2023-05-18 2023-05-18 Data processing method, device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116910054A true CN116910054A (en) 2023-10-20

Family

ID=88365611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310559630.1A Pending CN116910054A (en) 2023-05-18 2023-05-18 Data processing method, device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116910054A (en)

Similar Documents

Publication Publication Date Title
US20210382927A1 (en) System and method for hierarchically organizing documents based on document portions
CN108280114B (en) Deep learning-based user literature reading interest analysis method
US7873624B2 (en) Question answering over structured content on the web
US8082248B2 (en) Method and system for document classification based on document structure and written style
US8756245B2 (en) Systems and methods for answering user questions
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
US20160034514A1 (en) Providing search results based on an identified user interest and relevance matching
CN109947952B (en) Retrieval method, device, equipment and storage medium based on English knowledge graph
JP2020500371A (en) Apparatus and method for semantic search
US9311388B2 (en) Semantic and contextual searching of knowledge repositories
US8812504B2 (en) Keyword presentation apparatus and method
CN107967290A (en) A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data
CN111061828B (en) Digital library knowledge retrieval method and device
CN112559684A (en) Keyword extraction and information retrieval method
CN112883165B (en) Intelligent full-text retrieval method and system based on semantic understanding
KR20220119745A (en) Methods for retrieving content, devices, devices and computer-readable storage media
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN114117242A (en) Data query method and device, computer equipment and storage medium
Wu et al. Searching online book documents and analyzing book citations
Nie et al. Extracting objects from the web
Zhang et al. A system for extracting top-k lists from the web
JP2013222418A (en) Passage division method, device and program
WO2014049310A2 (en) Method and apparatuses for interactive searching of electronic documents
CN116910054A (en) Data processing method, device, electronic equipment and computer readable storage medium
CN112115269A (en) Webpage automatic classification method based on crawler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination