CN116991980A - Text screening model training method, related method, device, medium and equipment - Google Patents

Text screening model training method, related method, device, medium and equipment Download PDF

Info

Publication number
CN116991980A
CN116991980A CN202311255991.3A CN202311255991A CN116991980A CN 116991980 A CN116991980 A CN 116991980A CN 202311255991 A CN202311255991 A CN 202311255991A CN 116991980 A CN116991980 A CN 116991980A
Authority
CN
China
Prior art keywords
text
search
word segmentation
texts
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311255991.3A
Other languages
Chinese (zh)
Other versions
CN116991980B (en
Inventor
黄淼鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311255991.3A priority Critical patent/CN116991980B/en
Publication of CN116991980A publication Critical patent/CN116991980A/en
Application granted granted Critical
Publication of CN116991980B publication Critical patent/CN116991980B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a text screening model training method and a related method, device, medium and equipment. The method comprises the following steps: obtaining sample word segmentation texts corresponding to at least two granularity levels respectively by obtaining sample sentences and carrying out word segmentation processing on the sample sentences according to the at least two granularity levels; combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts; searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result; acquiring input features corresponding to each first search text; and training the basic text screening model according to the corresponding input characteristics and the corresponding relevance scores of each first search text to obtain a text screening model. The accuracy of the search results searched by the search text determined by the text screening model is high.

Description

Text screening model training method, related method, device, medium and equipment
Technical Field
The application relates to the technical field of computers, in particular to a text screening model training method and a related method, device, medium and equipment.
Background
In some search scenarios, after the input object inputs a section of search sentence, the search engine searches the database for a corresponding search result according to the input search sentence, and then presents the search result to the input object.
In the related art, after an input object inputs a section of search sentence by a search engine, the search engine extracts keywords from the search sentence, and then searches a corresponding result in a database according to the keywords.
In the research and practice process of the related art, the inventor of the present application finds that in the related art, the obtained search result may have an inaccurate problem according to the search in the database by the keyword, resulting in lower accuracy of text search.
Disclosure of Invention
The embodiment of the application provides a text screening model training method, a related device, a related medium and related equipment, which can improve the accuracy of searching texts.
In order to solve the technical problems, the embodiment of the application provides the following technical scheme:
A text screening model training method, comprising:
obtaining a sample sentence, and performing word segmentation processing on the sample sentence according to at least two granularity levels to obtain sample word segmentation texts with at least two granularity levels respectively corresponding to the two granularity levels;
combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts;
searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result;
acquiring input features corresponding to each first search text;
and training the basic text screening model according to the corresponding input characteristics and the corresponding relevance scores of each first search text to obtain a text screening model.
A text search method, comprising:
obtaining a search sentence, and performing word segmentation processing on the search sentence according to at least two granularity levels to obtain word segmentation texts respectively corresponding to the at least two granularity levels;
combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts;
Inputting each second search text into a text screening model trained by the text screening model training method provided by the embodiment of the application, and screening out target search text;
searching the database according to the target search text to obtain a target search result.
A text screening model training apparatus, comprising:
the first word segmentation module is used for acquiring a sample sentence, and carrying out word segmentation processing on the sample sentence according to at least two granularity levels to obtain sample word segmentation texts with at least two granularity levels corresponding to the granularity levels respectively;
the first combination module is used for combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts;
the scoring module is used for searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result;
the acquisition module is used for acquiring the input characteristics corresponding to each first search text;
and the training module is used for training the basic text screening model according to the corresponding input characteristics and the corresponding correlation scores of each first search text to obtain a text screening model.
In some embodiments, the first combining module is configured to:
the text obtained by combining the sample word segmentation texts corresponding to different granularity levels is determined to be a first text;
and determining the text obtained by combining the sample word segmentation texts corresponding to each granularity level as the first text.
In some embodiments, the granularity level includes a first granularity level and a second granularity level, the sample word segmentation text includes a first word segmentation text and a second word segmentation text, the first word segmentation module to:
performing word segmentation processing on the sample sentence according to the first granularity level to obtain a first word segmentation text corresponding to the first granularity level;
and performing word segmentation processing on the sample sentence according to the second granularity level to obtain a second word segmentation text corresponding to the second granularity level.
In some embodiments, the first combining module is configured to: before combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts, determining the semantic sequence corresponding to the sample sentences;
in some embodiments, the first combining module is configured to:
the text obtained by combining the first word segmentation text and the second word segmentation text according to the semantic sequence is determined to be a first search text;
And determining the text obtained by combining the first word segmentation texts according to the semantic sequence and the text obtained by combining the second word segmentation texts according to the semantic sequence as the first search text.
In some embodiments, the first combining module is configured to: before combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts, numbering each first word segmentation text to obtain a first numbering result;
numbering each second word segmentation text to obtain a second numbering result;
in some embodiments, the first combining module is configured to:
combining the first word segmentation text and the second word segmentation text according to the first numbering result and the second numbering result to obtain a plurality of combined texts;
determining an unambiguous combined text of the plurality of combined texts as a first search text;
combining the first word segmentation texts according to the first numbering result and combining the second word segmentation texts according to the second numbering result to obtain a plurality of combined texts;
an unambiguous combined text of the plurality of combined text is determined as a first search text.
In some embodiments, the search results include a plurality of sub-search results respectively corresponding to the degrees of correlation, the search results include a search number, and the sub-search results include a sub-search number; a scoring module for:
Acquiring a weight value corresponding to each correlation degree;
according to the weight value corresponding to each correlation degree and the corresponding sub-search quantity, determining a target value corresponding to each correlation degree;
and determining a relevance score between the first search text and the corresponding search result according to the target value and the search quantity corresponding to each relevance degree.
In some embodiments, the obtaining module is configured to:
extracting sentence characteristics corresponding to sample sentences;
extracting word segmentation characteristics corresponding to a sample word segmentation text in the first search text;
extracting statistical characteristics of a sample word segmentation text relative to a sample sentence in a first search text;
and determining the sentence characteristics, the word segmentation characteristics and the statistical characteristics as input characteristics corresponding to the first search text.
In some embodiments, the training module is to:
inputting the input features corresponding to the first search text into a basic text screening model to obtain an output value;
determining a loss value between the output value and the correlation score according to a preset loss function;
and if the loss value meets the preset loss condition, training the basic text screening model to obtain the text screening model.
A text search device comprising:
The second word segmentation module is used for acquiring a search sentence, and carrying out word segmentation processing on the search sentence according to at least two granularity levels to obtain word segmentation texts corresponding to the at least two granularity levels respectively;
the second combination module is used for combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts;
the screening module is used for inputting each second search text into a text screening model trained according to the text screening model training method, and screening out target search texts;
and the searching module is used for searching the database according to the target searching text to obtain a target searching result.
In some embodiments, the screening module is to:
inputting each second search text into a text screening model trained by the text screening model training method, and outputting an output value corresponding to each second search text;
and determining the second search text with the highest output value from the plurality of second search texts as a target search text.
A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the text screening model training method or the steps of the text search method described above.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing steps in the text screening model training method or steps in the text searching method when the computer program is executed.
A computer program product or computer program comprising computer instructions stored in a storage medium. The processor of the computer device reads the computer instructions from the storage medium and the processor executes the computer instructions such that the steps in the text screening model training method described above or the steps in the text search method described above are implemented.
In the embodiment of the application, a sample sentence is obtained, and word segmentation processing is carried out on the sample sentence according to at least two granularity levels, so that at least two sample word segmentation texts respectively corresponding to the granularity levels are obtained; combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts; searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result; acquiring input features corresponding to each first search text; and training the basic text screening model according to the corresponding input characteristics and the corresponding relevance scores of each first search text to obtain a text screening model. According to the method, word segmentation processing is carried out on sample sentences by utilizing different granularity levels to obtain word segmentation texts with a plurality of granularity levels corresponding to the sample sentences, then the word segmentation texts are combined according to at least two sample word segmentation texts with the granularity levels corresponding to the sample sentences to obtain a plurality of first search texts, the first search texts are used for searching a database to obtain search results corresponding to each first search text, then relevance scores between each first search text and the corresponding search results are determined, input features of each first search text are obtained, the relevance scores are used as labels, the input features are used as samples, training of a basic text screening model is achieved, a text screening model is obtained, the text screening model can identify the search text which is most suitable for searching, and then the database is searched by utilizing the search text to obtain the search results. Compared with the search results obtained by searching only by utilizing keywords in the related art, the search results obtained by searching the target search text screened by the text screening model are more accurate.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic view of a text search scenario provided by an embodiment of the present application;
FIG. 2 is a schematic flow chart of a text screening model training method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a data flow of text filtering model training according to an embodiment of the present application;
FIG. 4 is a schematic view of a word segmentation combination provided by an embodiment of the present application;
FIG. 5 is another data flow diagram of text filtering model training provided by an embodiment of the present application;
FIG. 6 is another flow chart of a text filtering model training method according to an embodiment of the present application;
fig. 7 is a schematic flow chart of a text searching method according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a data flow of text searching provided by an embodiment of the present application;
FIG. 9 is another schematic view of a text search provided by an embodiment of the present application;
fig. 10 is a schematic structural diagram of a text screening model training device according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a text search device according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the solution of the present application, a technical solution of an embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present application, and it is apparent that the described embodiment is only a part of the embodiment of the present application, not all the embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
It should be noted that, in some of the processes described in the specification, claims and drawings above, a plurality of steps appearing in a particular order are included, but it should be clearly understood that the steps may be performed out of order or performed in parallel, the step numbers are merely used to distinguish between the different steps, and the numbers themselves do not represent any order of execution. Furthermore, the description of "first," "second," or "object" and the like herein is for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application are suitable for the following explanation:
artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions, and the embodiment of the application relates to the natural language processing technology.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Particle size grade: the higher the granularity level is, the smaller the granularity is, and the higher the strength of word segmentation is, the more words are obtained, and the obtained words are more specific, for example, one word is contained in the obtained word. When the granularity level is lower, the granularity is larger, the strength of the sentence segmentation is lower, the number of the obtained segmentation is smaller, and the obtained segmentation is macroscopic, for example, one obtained segmentation comprises a plurality of words.
Mean-square error (MSE): is a measure reflecting the degree of difference between the estimated quantity and the estimated quantity.
In some search scenarios, after the input object inputs a section of search sentence, the search engine searches the database for a corresponding search result according to the input search sentence, and then presents the search result to the input object.
In the related art, after an input object inputs a section of search sentence by a search engine, the search engine extracts keywords from the search sentence, and then searches a corresponding result in a database according to the keywords.
In the research and practice process of the related art, the inventor of the present application finds that in the related art, the obtained search result may have an inaccurate problem according to the search in the database by the keyword, resulting in lower accuracy of text search.
In order to solve the technical problem, the embodiment of the application provides a text screening model training method, a related device, a related medium and related equipment, which can improve the accuracy of searching texts.
Referring to fig. 1, fig. 1 is a schematic view of a text search scenario according to an embodiment of the present application.
As shown in fig. 1, where device 1, device 2, and device 3 may be understood to be terminal devices, including but not limited to desktop computers, cell phones, notebook computers, tablet computers, smart wearable devices, and the like. The objects of the terminal devices can be used for locally inputting corresponding search sentences, for example, the objects can input the terminal devices through various input modes such as voice input, keyboard input, gesture input, image input, limb language input and the like, and the terminal devices can convert input information into text information, such as the text information into the search sentences, and then input the search sentences into a search engine.
The search engine sends the search statement to the server through the network, and after receiving the search statement, the server performs word segmentation processing on the search statement according to at least two granularity levels to obtain word segmentation texts corresponding to the at least two granularity levels respectively; combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts; then inputting each second search text into a text screening model trained by the text screening model training method provided by the embodiment of the application, and screening out target search text; and finally searching the database according to the target search text to obtain a target search result.
For example, the server searches the local database or the accessible database corresponding to the network for the corresponding search result according to the target search text, and finally the server sends the search result to the terminal device, so as to realize the presentation of the search result to the object using the terminal device.
And the word segmentation processing is carried out on the search sentence according to at least two granularity levels, and then the word segmentation texts corresponding to the at least two granularity levels are combined according to the word segmentation texts respectively to obtain a plurality of second search texts. And finally screening the plurality of second search texts through a text screening model to obtain target search texts. The target search text is more suitable for a server to search the database, so that the search result searched in the database is more accurate, such as searching articles more conforming to search sentences. Compared with the prior art that only keywords are used for searching, the method and the device have the advantages that the search results obtained by searching the text screened by the text screening model are more accurate, and therefore the accuracy of searching is improved.
In order to understand the text filtering model training method provided by the embodiment of the present application in more detail, please continue to refer to fig. 2, fig. 2 is a flow chart of the text filtering model training method provided by the embodiment of the present application. The main body for executing the text filtering model training method can be a server. The text filtering model training method can comprise the following steps:
in step 110, a sample sentence is obtained, and word segmentation processing is performed on the sample sentence according to at least two granularity levels, so as to obtain sample word segmentation texts corresponding to the at least two granularity levels respectively.
In some embodiments, the word segmentation process may be performed on the search sentence according to a default granularity level, so as to obtain a word segmentation text corresponding to the granularity level, where the word segmentation text corresponding to the granularity level may be directly used to generate a search text, and finally the search text is used to search the database, so as to obtain a search result.
But only one granularity level is adopted to perform word segmentation processing to obtain word segmentation text, and then search text generated according to the word segmentation text does not have universality, and aiming at some data in a database, the situation of searching omission can exist, so that omission exists in a final search result. For example, in searching for some articles, searching for the text using the search may result in some articles not being searched.
In view of the problem, in the embodiment of the application, in the process of searching, word segmentation processing is performed on an input sentence according to at least two granularity levels to obtain word segmentation texts corresponding to each granularity level, and then the word segmentation texts corresponding to different granularity levels are combined to obtain a plurality of search texts. The text screening model is utilized to screen out the target search text which is most suitable for searching from a plurality of search texts, and the result searched in the database by utilizing the target search text is more accurate and comprehensive.
Training the text screening model is also required before the text screening model is used. Therefore, in the application, a basic text screening model is firstly set, and then the basic text screening model is trained, so that the text screening model is obtained.
Before training the basic text screening model, a training sample corresponding to the basic text screening model needs to be acquired, for example, a sample sentence can be acquired, and the sample sentence can be determined to be the training sample. The sample sentence may be a historical search sentence in the search engine, for example, a historical search sentence in the search engine, such as "a museum in a certain area", "an astronomical platform in a certain urban area", "a quantum mechanics principle, and a founder of quantum mechanics", is input in a historical time, and the historical search sentence may be determined as the sample sentence.
After the sample sentence is obtained, word segmentation processing can be performed on the sample sentence according to at least two granularity levels. The higher the granularity level is, the higher the granularity is, the higher the strength of word segmentation is, the more words are obtained, and the obtained words are more specific and detailed, for example, a single word is contained in one obtained word. When the granularity level is lower, the granularity is larger, the strength of the sentence segmentation is lower, the number of the obtained segmentation is smaller, and the obtained segmentation is macroscopic, for example, one obtained segmentation comprises a plurality of words.
For example, the particle size class has 1 and 5 stages. When the granularity level is 1 level and the sample sentence is "A Dijia certain area teppanyaki", the three sample word segmentation texts of "A Dijia certain", "area", "teppanyaki" are obtained after the sample sentence is segmented by adopting the granularity level 1 level.
When the granularity level is 5 levels and the sample sentence is "A Dijia certain area teppanyaki", after the granularity level 5 levels are adopted to segment the sample sentence, the obtained segmented words are five sample segmented word texts of "A Di", "Jia certain", "area", "teppanyaki".
It should be noted that, in the actual word segmentation process of the sample sentence, two or more different granularity levels may be adopted to perform word segmentation process on the sample sentence, so as to obtain a sample word segmentation text corresponding to each granularity level.
In some embodiments, the granularity level includes a first granularity level and a second granularity level, the word segmentation processing is performed on the sample sentence according to at least two granularity levels to obtain sample word segmentation texts corresponding to at least two granularity levels, including:
performing word segmentation processing on the sample sentence according to the first granularity level to obtain a first word segmentation text corresponding to the first granularity level;
and (1.2) performing word segmentation processing on the sample sentence according to the second granularity level to obtain a second word segmentation text corresponding to the second granularity level.
The method comprises the steps of carrying out word segmentation processing on sample sentences through different granularity levels to obtain word segmentation texts corresponding to the different granularity levels, and combining the word segmentation texts corresponding to the different granularity levels to obtain a first search text, wherein the first search text comprises a plurality of word segmentation texts corresponding to the granularity levels. Therefore, the word segmentation text richness of the first search text is increased, so that the first search text is more suitable for searching, and the searching accuracy of the search result is improved in the subsequent searching process.
The first granularity level is higher than the second granularity level, namely the first granularity corresponds to fine granularity, and the second granularity corresponds to coarse granularity.
And then performing word segmentation processing on the sample sentence according to the first granularity level to obtain a first word segmentation text corresponding to the first granularity level. For example, when the sample sentence is "a-region teppanyaki", the first granularity level is adopted to segment the sample sentence, and then the obtained segments are five first segment texts, namely "a-region", "teppanyaki" and "teppanyaki".
And performing word segmentation processing on the sample sentence according to the second granularity level to obtain a second word segmentation text corresponding to the second granularity level. For example, when the sample sentence is "a-ground beetle is a certain area of teppanyaki", the second granularity level is adopted to segment the sample sentence, and then the obtained segmented words are three second segmented words, namely "a-ground beetle", "area", "teppanyaki".
For a more detailed understanding of the word segmentation process of the sample sentence, please refer to fig. 3, fig. 3 is a schematic diagram of a data flow of text filtering model training according to an embodiment of the present application.
As shown in fig. 3, there are a word splitter 1 and a word splitter 2, and the first granularity level may be set to the word splitter 1, and the second granularity level may be set to the word splitter 2. After the sample sentence is acquired, the sample sentence may be input to the word segmenter 1 and the word segmenter 2, respectively. The word segmentation device 1 carries out word segmentation processing on the sample sentences according to the first granularity level, so that a first word segmentation text is obtained. The word segmentation device 2 carries out word segmentation processing on the sample sentences according to the second granularity level, so as to obtain second word segmentation texts. The sample word segmentation text comprises a first word segmentation text and a second word segmentation text.
It should be noted that, in the embodiment of the present application, the method is not limited to the word segmentation processing of the sample sentence by the first granularity level and the second granularity level, and other granularity levels may be used to segment the sample sentence, for example, a third granularity level is used to segment the sample sentence, so as to obtain a third segmented text, where the third granularity level is smaller than the first granularity level and smaller than the second granularity level.
In step 120, the sample word segmentation texts corresponding to at least two granularity levels are combined to obtain a plurality of first search texts.
It can be understood that after the sample word segmentation text corresponding to each granularity level is obtained, there may be multiple combinations, for example, the sample word segmentation texts corresponding to different granularity levels are mixed and combined, or the sample word segmentation texts corresponding to each granularity level are separately combined, so as to obtain multiple first search texts.
In some embodiments, combining the sample word segmentation texts according to at least two granularity levels respectively corresponding to the sample word segmentation texts to obtain a plurality of first search texts, including:
(1.1) combining the corresponding sample word segmentation texts among different granularity levels to obtain a text which is determined to be a first search text;
And (1.2) determining a text obtained by combining the sample word segmentation texts corresponding to each granularity level as a first search text.
Considering that there are sample word segmentation texts corresponding to different granularity levels respectively, two combination modes generally exist for combining the first search text according to the sample word segmentation texts, one is that the sample word segmentation text with one granularity level is combined with the sample word segmentation text with other granularity levels, and the other is that the sample word segmentation text with each granularity level is combined, so that a plurality of first search samples are obtained. Therefore, the richness of the first search sample can be increased, the training sample of the basic text screening model is increased, the basic text screening model is trained in a final manner, and the trained text screening model can more accurately find out the text used for searching.
And determining a text obtained by combining the sample word segmentation texts corresponding to different granularity levels as a first search text. For example, taking the first word segmentation text with the first granularity level and the second word segmentation text with the second granularity level as examples, at least part of the first word segmentation text and at least part of the second word segmentation text can be mixed and combined, so that a plurality of first search texts are obtained.
And determining a text obtained by combining the sample word segmentation texts corresponding to each granularity level as a first search text. For example, taking the first word segmentation text with the first granularity level as an example, at least part of the word segmentation texts in the plurality of first word segmentation texts may be combined, so as to obtain a plurality of first search texts.
In some embodiments, taking the first granularity level and the second granularity level as examples, after the first word segmentation text and the second word segmentation text are obtained, different first word segmentation texts and different second word segmentation texts may be combined in a sequence according to a certain sequence combination rule, the first word segmentation texts may be combined in a sequence, and the second word segmentation texts may be combined in a sequence, so as to obtain a plurality of first search texts.
In some embodiments, after obtaining the sample word segmentation texts respectively corresponding to the different at least two granularity levels, before combining the sample word segmentation texts respectively corresponding to the at least two granularity levels to obtain a plurality of first search texts, a semantic order corresponding to the sample sentence may be determined, where the semantic order may be a logic order of the sample sentence when input, and the semantic order may be an order in a grammar corresponding to the sample sentence. When determining the semantic order corresponding to the sample sentence, the order identification of each word in the sample sentence can be determined.
Taking the first granularity level and the second granularity level as examples, after the first word segmentation text and the second word segmentation text are obtained, the text obtained by combining the sample word segmentation texts corresponding to different granularity levels is determined to be a first search text, and the method comprises the following steps:
(2.1) determining a text obtained by combining the first word segmentation text and the second word segmentation text according to the semantic order as a first search text.
It will be appreciated that combining the sample word segmentation text according to correspondence between different granularity levels results in a first search text that is actually text that is intended to have a normal semantic order, thus facilitating the search engine to search the database.
And combining the first word segmentation text and the second word segmentation text through semantic sequences to obtain a plurality of first search texts. The plurality of first search texts are generated according to the semantic order of the sample sentence, that is, the plurality of first search texts are more consistent with the original meaning of the search sentence. Therefore, the plurality of first search texts are more suitable for searching the database, and the accuracy of searching can be improved in the subsequent searching process.
When the semantic order corresponding to the sample sentence is determined, the order identification of each word in the sample sentence is determined. When the first word segmentation text and the second word segmentation text are combined according to the semantic order, the combination can be performed according to the sequence identification of each word in the first word segmentation text and the second word segmentation text, for example, a sample sentence is "A Dijia certain area teppanyaki". Wherein, one first search text of the combination is "A land", "A certain", "region", "Teppanyaki", and the semantic order of the first search text is the same as the semantic order of the sample sentence.
Taking the first granularity level and the second granularity level as examples, after the first word segmentation text and the second word segmentation text are obtained, the text obtained by combining the sample word segmentation text corresponding to each granularity level is determined to be a first search text, and the method comprises the following steps:
and (2.2) determining the text obtained by combining the first word segmentation texts according to the semantic order and the text obtained by combining the second word segmentation texts according to the semantic order as the first search text.
It is understood that a text obtained by combining the first divided text according to the semantic order and a text obtained by combining the second divided text according to the semantic order are determined as the first search text. The plurality of first search texts are generated according to the semantic order of the sample sentence, that is, the plurality of first search texts are more consistent with the original meaning of the search sentence. Therefore, the plurality of first search texts are more suitable for searching the database, and the accuracy of searching can be improved in the subsequent searching process.
And determining the text obtained by combining the first word segmentation texts according to the semantic order as a first search text. For example, the sample sentence is "A Dijia certain area teppanyaki". The first word segmentation text comprises five word segmentation texts of "A ground", "A certain", "region", "iron plate", "baked". The first search text of one combination is "Jia-certain", "section", "iron plate", "burn", and the semantic order of the first search text is the same as the semantic order of the sample sentence.
And determining the text obtained by combining the second word text according to the semantic order as a first search text. For example, the sample sentence is "A Dijia certain area teppanyaki". The second word segmentation text comprises three word segmentation texts of 'A Dijia some', 'district', 'teppanyaki'. One of the combined first search texts is "teppanyaki", and the semantic order of the first search text is the same as the semantic order of the sample sentence.
Referring to fig. 4 together, fig. 4 is a schematic view of a word segmentation combination according to an embodiment of the present application.
Wherein, the sample sentence is "A Dijia certain area teppanyaki". The first word segmentation text comprises five word segmentation texts of "A ground", "A certain", "region", "iron plate", "baked". The second word segmentation text comprises three word segmentation texts of 'A Dijia some', 'district', 'teppanyaki'.
After determining the semantic order of the sample sentence, a directed acyclic graph corresponding to the first word segmentation text and the second word segmentation text can be constructed. As shown in fig. 4, the first and second segmented text are combined starting from the leftmost input, so that the first search text is obtained at the rightmost output.
The first word segmentation combination comprises a first word segmentation text 'A ground', 'A certain' and 'region'. The second word text "teppanyaki" is included. After the first word segmentation text and the second word segmentation text are combined, a first search text A, a first region, and an iron board baked are obtained.
The second word segmentation combination comprises a first word segmentation text area, an iron plate and a burn. The second word text "A Dijia certain" and "district" is included. After the first word segmentation text and the second word segmentation text are combined, a certain region, an iron plate and a burn of the first search text A are obtained.
It will be appreciated that in the case where there are more first and second word-separated texts, more first search texts may be constructed, and the two first search texts constructed above are merely exemplary and not limiting to the present application.
For another example, for the first word segmentation text, the first search text "a", "a" region "," iron plate "and" burn "may be generated directly in semantic order. For another example, for the second word-breaking text, the first search text "a-Dijia-certain" region "and" teppanyaki "may be generated directly in semantic order.
In some embodiments, taking the first granularity level and the second granularity level as examples, after obtaining the first word segmentation text and the second word segmentation text, before combining the sample word segmentation texts respectively corresponding to the at least two granularity levels to obtain a plurality of first search texts, the method includes:
Numbering each first word segmentation text to obtain a first numbering result; numbering each second word segmentation text to obtain a second numbering result.
Numbering each first word segmentation text to obtain a first numbering result. For example, the first word segmentation text includes five word segmentation texts of "a ground", "a certain", "a region", "iron plate", "a roast", and the five first word segmentation texts may be respectively numbered, for example, five numbers of A1, A2, A3, A4, and A5 in sequence, so as to obtain a first numbering result.
Numbering each second word segmentation text to obtain a second numbering result. For example, the second word segmentation text includes three word segmentation texts of "Dijia some", "district" and "teppanyaki", and the three second word segmentation texts may be respectively numbered, for example, three numbers of B1, B2 and B3 in sequence, so as to obtain a second numbering result.
In one embodiment, taking the first granularity level and the second granularity level as examples, after the first word segmentation text and the second word segmentation text are obtained, text obtained by combining sample word segmentation texts corresponding to different granularity levels is determined to be a first search text, and the method includes:
(3.1) combining the first word segmentation text and the second word segmentation text according to the first numbering result and the second numbering result to obtain a plurality of combined texts;
(3.2) determining an unambiguous combined text of the plurality of combined texts as the first search text.
And combining the first word segmentation text and the second word segmentation text according to the first numbering result and the second numbering result to obtain a plurality of combined texts. The obtained combined text can contain all permutation and combination of the first word segmentation text and the second word segmentation text, so that various combined texts can be obtained, and some combined texts cannot be omitted. And then determining the disambiguated combined text in the combined texts as a first search text, so that the number of the first search samples is increased while the disambiguation of the first search text is ensured, and therefore, training samples corresponding to the basic text screening model are increased, training of the basic text screening model is facilitated finally, and the trained text screening model can more accurately find out the text for searching.
And randomly combining the first word segmentation text and the second word segmentation text according to the first numbering result and the second numbering result to obtain a plurality of combined texts. For example, the combination text "a", "a-certain", "region", "teppanyaki" is obtained by sequentially combining the numbers A1, A2, A3, and B3. The numbers B1, B2, A4 and A5 are combined in sequence, so that a combination text of ' Dijia certain ' area ', ' iron plate ', ' roast ' is obtained. For example, the combination text of 'area', 'iron plate', 'burn' is obtained by sequentially combining the numbers of B2, A4 and A5.
Finally, an unambiguous combined text of the plurality of combined texts is determined as a first search text. For example, in the above combined text, the combined text "a", "a" region "," teppanyaki "and the combined text" a "region", "teppanyaki" are disambiguated, and the two are determined as the first search text. While the combined text "region", "iron plate", "burn" is not clear and ambiguous, the combined text cannot be determined as the first search text.
In one embodiment, taking the first granularity level and the second granularity level as examples, after obtaining the first word segmentation text and the second word segmentation text, determining a text obtained by combining the sample word segmentation text corresponding to each granularity level as a first search text, where the text is determined by:
(3.3) combining the first word segmentation texts according to the first numbering result and combining the second word segmentation texts according to the second numbering result to obtain a plurality of combined texts;
(3.4) determining an unambiguous combined text of the plurality of combined texts as the first search text.
And combining the first word segmentation texts according to the first numbering result and combining the second word segmentation texts according to the second numbering result to obtain a plurality of combined texts, wherein the plurality of combined texts comprise all text combinations of each granularity level, so that some combined texts are not missed. And then determining the disambiguated combined text in the combined texts as a first search text, so that the number of the first search samples is increased while the disambiguation of the first search text is ensured, and therefore, training samples corresponding to the basic text screening model are increased, training of the basic text screening model is facilitated finally, and the trained text screening model can more accurately find out the text for searching.
The method comprises the steps of combining first word segmentation texts according to a first numbering result, and combining second word segmentation texts according to a second numbering result to obtain a plurality of combined texts. For example, taking the first word segmentation text as an example, the combination of the numbers A1, A2, A3, A4 and A5 is used for obtaining a combination text "A", "A-certain", "region", "iron plate" and "roast". The combination text of "Jiacertain", "region", "iron plate" and "burn" is obtained by combining the numbers of A2, A3, A4 and A5. Taking the second word text as an example, and combining the numbers B1, B2 and B3 to obtain a combined text of ' Dijia certain ' area ' and ' teppanyaki '. And combining with numbers B2 and B3 to obtain a combined text of 'region', 'teppanyaki'.
After the plurality of combined texts is obtained, an unambiguous combined text of the plurality of combined texts may be determined as a first search text. For example, where the combined text "region" is not clear and there is ambiguity, the combined text cannot be determined to be the first search text. The combined text "A Dijia certain", "region", "teppanyaki" is unambiguous and can be used as the first search text.
In step 130, the database is searched according to each first search text, so as to obtain a search result corresponding to each first search text, and a relevance score between each first search text and the corresponding search result is determined.
After the first search text is obtained, searching the database by utilizing each first search text, so as to obtain a search result corresponding to each first search text. Because the first search texts are different, the search results searched by the search engine according to the different first search texts are also different, and therefore, the relevance score between each first search text and the corresponding search result needs to be determined, and the first search text is better and more suitable for searching.
In some implementations, the relevance score between the first search text and the corresponding search result can be determined directly from the number of search results corresponding to the first search text. For example, if the number of search results corresponding to one first search text is 80 and the number of search results corresponding to another first search text is 60, then the relevance score of the first search text corresponding to the number of search results 80 is higher than the relevance score of the first search text corresponding to the number of search results 60.
In some embodiments, for a search result of a first search text, there are a plurality of sub-search results corresponding to respective degrees of relevance. For example, the search results corresponding to the first search text are a plurality of articles, wherein the relevance is divided into five relevance levels of a first level, a second level, a third level, a fourth level and a fifth level, the first level is the lowest relevance level, and the fifth level is the highest relevance level. The search results include a number of searches and the sub-search results include a number of sub-searches. For example, the search results are articles, the number of searches is 160, the number of sub-searches of the articles corresponding to the first-level correlation degree is 53, the number of sub-searches of the articles corresponding to the second-level correlation degree is 72, the number of sub-searches of the articles corresponding to the third-level correlation degree is 5, the number of sub-searches of the articles corresponding to the fourth-level correlation degree is 30, and the number of sub-searches of the articles corresponding to the fifth-level correlation degree is 0.
Wherein determining a relevance score between the first search text and the corresponding search result comprises:
(1.1) obtaining a weight value corresponding to each correlation degree;
(1.2) determining a target value corresponding to each correlation degree according to the weight value corresponding to each correlation degree and the corresponding sub-search quantity;
and (1.3) determining a relevance score between the first search text and the corresponding search result according to the target value and the search quantity corresponding to each relevance degree.
It will be appreciated that some of the search results are relatively relevant to the first search text and some are not very relevant to the first search text. Therefore, the search results of the first search text are required to be divided according to different degrees of correlation, and then the correlation scoring is carried out on the first search text and the corresponding search results according to different degrees of correlation, so that the obtained correlation scoring is more accurate. Because the relevance score is needed to be used in the subsequent training of the basic text screening model, the better the training effect of the basic text screening model is under the condition of more accurate relevance score, the text screening model trained later can more accurately determine the search text needed by the search.
If the correlation degree is divided into five correlation degrees of one stage, two stages, three stages, four stages and five stages, a corresponding weight value can be set for each correlation degree, wherein the weight value corresponding to the first stage correlation degree is the lowest, then the weight value corresponding to the second stage correlation degree is higher than the weight value corresponding to the first stage correlation degree, and the weight value corresponding to the five stages of correlation degree is the highest. That is, the higher the degree of correlation, the higher the weight value corresponding to the degree of correlation.
And then determining a target value corresponding to each correlation degree according to the weight value corresponding to each correlation degree and the corresponding sub-search quantity. For example, the corresponding target value under the correlation degree is obtained by multiplying the weight value corresponding to each correlation degree by the corresponding sub-search number. And if the first search text corresponds to a plurality of correlation degrees, acquiring a target value corresponding to each correlation degree.
For example, the number of sub-searches of the articles corresponding to the first-level correlation degree is 53, and the target value corresponding to the first-level correlation degree is obtained by multiplying 53 by the weight value corresponding to the first-level correlation degree. The number of sub searches of articles corresponding to the secondary correlation degree is 72, and a target value corresponding to the secondary correlation degree is obtained by multiplying 72 by a weight value corresponding to the primary correlation degree. The number of sub searches of articles corresponding to the three-level correlation degree is 5, and 53 is multiplied by a weight value corresponding to the three-level correlation degree to obtain a target value corresponding to the three-level correlation degree. The number of sub searches of articles corresponding to the four-level correlation degree is 30, and a target value corresponding to the four-level correlation degree is obtained by multiplying the weight value corresponding to the four-level correlation degree by 30. The number of sub searches of the articles corresponding to the five-level correlation degree is 0, and a target value corresponding to the five-level correlation degree is obtained by multiplying 0 by a weight value corresponding to the five-level correlation degree.
And finally, determining a relevance score between the first search text and the corresponding search result according to the target value and the search quantity corresponding to each relevance degree. Specifically, the target values corresponding to each correlation degree may be added to obtain an addition result; and dividing the added result by the search quantity to determine a relevance score between the first search text and the corresponding search result.
For example, the addition result is obtained by adding the target values corresponding to the five correlation degrees of the first stage, the second stage, the third stage, the fourth stage, and the fifth stage, respectively. The added result is then divided by the number of searches 160 corresponding to the first search text, resulting in a relevance score between the first search text and the corresponding search result. In this way, a relevance score between each first search text and the corresponding search results may be determined in this manner.
With continued reference to table 1, there are a plurality of first search texts in table 1, and a search result corresponding to each search text, and a relevance score between each first search text and the corresponding search result.
TABLE 1
Taking the first search text with the correlation divided equally into 0.72 in table 1 as an example, the database is searched by using the first search text, so as to obtain the number of sub-searches corresponding to different correlation degrees respectively, for example, the number of sub-searches corresponding to the first-level correlation degree is 53, the number of sub-searches corresponding to the second-level correlation degree is 72, the number of sub-searches corresponding to the third-level correlation degree is 5, the number of sub-searches corresponding to the fourth-level correlation degree is 30, and the number of sub-searches corresponding to the fifth-level correlation degree is 0. The number of searches for the search results corresponding to the first search text is 160. And multiplying the sub-search quantity corresponding to each correlation degree by a corresponding weight value to obtain a target value corresponding to each correlation degree. And adding all target values to obtain an added result, dividing the added result by the search quantity, and determining a relevance score between the first search text and the corresponding search result. I.e., a relevance score of 0.72 for the first search text.
Similarly, relevance scores corresponding to other first search texts can be determined in the mode.
It should be noted that, the relevance score between the first search text and the corresponding search result is just one way, and the relevance score between the first search text and the corresponding search result may be determined by other calculation manners.
In step 140, input features corresponding to each of the first search texts are obtained.
Referring to fig. 3, after obtaining the plurality of first search texts, the plurality of first search texts may be used as training samples of the basic text filtering model. However, to facilitate training of the basic text filtering model, it is also necessary to convert the first search text from a natural sentence into data that can be identified by the model, for example, encode the first search text, and then obtain a vector corresponding to the first search text, where the vector may be used as an input feature corresponding to the first search text.
In some embodiments, obtaining the input feature corresponding to the first search text includes:
(1.1) extracting sentence characteristics corresponding to the sample sentences;
(1.2) extracting word segmentation characteristics corresponding to the sample word segmentation text in the first search text;
(1.3) extracting statistical characteristics of sample word segmentation texts relative to sample sentences in the first search text;
(1.4) determining sentence characteristics, word segmentation characteristics and statistical characteristics as input characteristics corresponding to the first search text.
It can be understood that the input features corresponding to the first search text may be formed by combining a plurality of features, so that data corresponding to the input features is enriched, and thus, the basic text screening model can learn and train from more dimensions in the training process, and the trained text screening model can screen the text used for searching from a plurality of search texts more accurately.
And extracting sentence characteristics corresponding to the sample sentences. For example, the sample sentence is "A Dijia certain area teppanyaki". Each word in the sample sentence can be encoded, then a corresponding vector of each word is obtained, then the vectors of each word are added, so that a final corresponding vector of the sample sentence is obtained, and the final corresponding vector of the sample sentence is determined as sentence characteristics corresponding to the sample sentence. For example, the five words "a ground", "a certain", "region" and "teppanyaki" may be encoded respectively to obtain a vector corresponding to each word, and then the vectors corresponding to each word are added to obtain the sentence characteristics corresponding to the sample sentence.
And extracting word segmentation characteristics corresponding to the sample word segmentation text in the first search text. And coding each sample word segmentation text in the first search text to obtain a vector corresponding to each sample word segmentation text, and then obtaining a mean value corresponding to the vector corresponding to the sample word segmentation text, so as to obtain word segmentation characteristics corresponding to the sample word segmentation text in the first search text. For example, the first search text is "Jia-certain" and "teppanyaki", and the "Jia-certain" and "teppanyaki" can be respectively encoded to obtain vectors corresponding to the "Jia-certain" and "teppanyaki" respectively, and then a mean value corresponding to the vectors is obtained, so that word segmentation characteristics corresponding to the sample word segmentation text in the first search text are obtained.
And extracting statistical characteristics of the sample word segmentation text relative to the sample sentence in the first search text. The first number of words of the sample word segmentation text in the first search text can be determined, the second number of words in the sample sentence is determined, and then the first number and the second number are used as statistical features.
After the sentence feature, the word segmentation feature and the statistical feature are obtained, all three can be determined to be input features corresponding to the first search text.
In step 150, training the basic text screening model according to the corresponding input features and the corresponding relevance scores of each first search text to obtain a text screening model.
It can be appreciated that the input features corresponding to the first search text can be input as training samples, and the relevance scores corresponding to the first search text can be used as labels, so that training of the basic text screening model is achieved.
Referring to fig. 5, fig. 5 is another data flow diagram of text filtering model training according to an embodiment of the application.
After the input features corresponding to each first search text are obtained, the input features corresponding to each first search text can be input to a basic text screening model for training, the basic text screening model can output corresponding output values which are scores predicted by the basic text screening model, the output values can be compared with relevance scores serving as labels to obtain differences, the basic text screening model is subjected to back propagation training according to the differences, iteration is performed to continuously optimize the basic text screening model until iteration conditions are met, namely the differences are converged or the iteration times reach a certain number of times, for example, the training times reach 1000 times, and the basic text screening model is trained to be finished to obtain the text screening model.
In some embodiments, training the basic text screening model according to the input features and the relevance scores corresponding to the first search text to obtain a text screening model, including:
inputting the input features corresponding to the first search text into a basic text screening model to obtain an output value;
(1.2) determining a loss value between the output value and the relevance score according to a preset loss function;
and (1.3) if the loss value meets the preset loss condition, training the basic text screening model to obtain the text screening model.
It will be appreciated that using the input features as a sample and the relevance scores as a label, the effectiveness of training the underlying text-screening model may be measured to determine whether the underlying text-screening model is trained.
The input features corresponding to the first search text are input into a basic text screening model, and the basic text screening model can process the input features so as to obtain an output value. In the application, a mean square error loss function can be adopted to determine the loss value between the corresponding output value of the first search text and the corresponding relevance score.
Wherein, the mean square error loss function formula is as follows:
Wherein F is a mapping value corresponding to the basic text screening model, x is an input feature, y is a relevance score,is a Sigmoid function.
After calculating the loss value between the output value corresponding to the first search text and the corresponding correlation score through the preset loss function, if the loss value is within the preset loss value interval, the output result of the basic text screening model is considered to be accurate. If the loss value is not within the preset loss value interval, the output result of the basic text screening model is considered to be inaccurate, and the basic text screening model needs to be trained continuously. For example, if the preset loss value interval is 0-0.2 and the loss value is 0.3, the output result of the basic text screening model is considered to be inaccurate, and at this time, the basic text screening model needs to be continuously trained.
Through the method, the basic text screening model can be trained according to the input characteristics and the relevance scores corresponding to each first search text until the loss value corresponding to the output value of the basic text screening model meets the preset loss condition, and the basic text screening model is trained to obtain the text screening model.
Referring to fig. 3, after the input features corresponding to each first search text are input to the basic text filtering model, the basic text filtering model may output corresponding output results, for example, the first search text 1 corresponds to the output result 1, the first search text 2 corresponds to the output result 2, and the first search text N corresponds to the output result N. And when the loss value of each output result meets the preset loss condition, training the basic text screening model to obtain the text screening model.
From the above, in the embodiment of the present application, a sample sentence is obtained, and word segmentation processing is performed on the sample sentence according to at least two granularity levels, so as to obtain sample word segmentation texts corresponding to at least two granularity levels respectively; combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts; searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result; acquiring input features corresponding to each first search text; and training the basic text screening model according to the corresponding input characteristics and the corresponding relevance scores of each first search text to obtain a text screening model. According to the method, word segmentation processing is carried out on sample sentences by utilizing different granularity levels to obtain word segmentation texts with a plurality of granularity levels corresponding to the sample sentences, then the word segmentation texts are combined according to at least two sample word segmentation texts with the granularity levels corresponding to the sample sentences to obtain a plurality of first search texts, the first search texts are used for searching a database to obtain search results corresponding to each first search text, then relevance scores between each first search text and the corresponding search results are determined, input features of each first search text are obtained, the relevance scores are used as labels, the input features are used as samples, training of a basic text screening model is achieved, a text screening model is obtained, the text screening model can identify the search text which is most suitable for searching, and then the database is searched by utilizing the search text to obtain the search results. Compared with the search results obtained by searching only by utilizing keywords in the related art, the search results obtained by searching the target search text screened by the text screening model are more accurate.
Fig. 6 is another flow chart of a text filtering model training method according to an embodiment of the application. The text filtering model training method can comprise the following steps:
in step 201, a sample sentence is obtained, and word segmentation processing is performed on the sample sentence according to a first granularity level, so as to obtain a first word segmentation text corresponding to the first granularity level.
For example, when the sample sentence is "a-region teppanyaki", the first granularity level is adopted to segment the sample sentence, and then the obtained segments are five first segment texts, namely "a-region", "teppanyaki" and "teppanyaki".
In step 202, word segmentation processing is performed on the sample sentence according to the second granularity level, so as to obtain a second word segmentation text corresponding to the second granularity level.
For example, when the sample sentence is "a-ground beetle is a certain area of teppanyaki", the second granularity level is adopted to segment the sample sentence, and then the obtained segmented words are three second segmented words, namely "a-ground beetle", "area", "teppanyaki".
The first granularity level is higher than the second granularity level, namely the first granularity is fine granularity, and the second granularity is coarse granularity.
In step 203, a semantic order corresponding to the sample sentence is determined, and a text obtained by combining the first word segmentation text and the second word segmentation text according to the semantic order is determined as a first search text.
The semantic order corresponding to the sample sentence can be determined, the semantic order can be a logic order of the sample sentence when the sample sentence is input, and the semantic order can be an order in the grammar corresponding to the sample sentence. When determining the semantic order corresponding to the sample sentence, the order identification of each word in the sample sentence can be determined.
It will be appreciated that combining the sample word segmentation text according to correspondence between different granularity levels results in a first search text that is actually text that is intended to have a normal semantic order, thus facilitating the search engine to search the database.
And combining the first word segmentation text and the second word segmentation text through semantic sequences to obtain a plurality of first search texts. The plurality of first search texts are generated according to the semantic order of the sample sentence, that is, the plurality of first search texts are more consistent with the original meaning of the search sentence. Therefore, the plurality of first search texts are more suitable for searching the database, and the accuracy of searching can be improved in the subsequent searching process.
When the first word segmentation text and the second word segmentation text are combined according to the semantic order, the combination can be performed according to the sequence identification of each word in the first word segmentation text and the second word segmentation text, for example, a sample sentence is "A Dijia certain area teppanyaki". Wherein, one first search text of the combination is "A land", "A certain", "region", "Teppanyaki", and the semantic order of the first search text is the same as the semantic order of the sample sentence.
In step 204, a text obtained by combining the first word-divided text according to the semantic order and a text obtained by combining the second word-divided text according to the semantic order are determined as the first search text.
And determining the text obtained by combining the first word segmentation texts according to the semantic order as a first search text. For example, the sample sentence is "A Dijia certain area teppanyaki". The first word segmentation text comprises five word segmentation texts of "A ground", "A certain", "region", "iron plate", "baked". The first search text of one combination is "Jia-certain", "section", "iron plate", "burn", and the semantic order of the first search text is the same as the semantic order of the sample sentence.
And determining the text obtained by combining the second word text according to the semantic order as a first search text. For example, the sample sentence is "A Dijia certain area teppanyaki". The second word segmentation text comprises three word segmentation texts of 'A Dijia some', 'district', 'teppanyaki'. One of the combined first search texts is "teppanyaki", and the semantic order of the first search text is the same as the semantic order of the sample sentence.
In step 205, a weight value corresponding to each degree of correlation is obtained.
In some embodiments, for a search result of a first search text, there are a plurality of sub-search results corresponding to respective degrees of relevance. For example, the search results corresponding to the first search text are a plurality of articles, wherein the relevance is divided into five relevance levels of a first level, a second level, a third level, a fourth level and a fifth level, the first level is the lowest relevance level, and the fifth level is the highest relevance level. The search results include a number of searches and the sub-search results include a number of sub-searches. For example, the search results are articles, the number of searches is 160, the number of sub-searches of the articles corresponding to the first-level correlation degree is 53, the number of sub-searches of the articles corresponding to the second-level correlation degree is 72, the number of sub-searches of the articles corresponding to the third-level correlation degree is 5, the number of sub-searches of the articles corresponding to the fourth-level correlation degree is 30, and the number of sub-searches of the articles corresponding to the fifth-level correlation degree is 0.
If the correlation degree is divided into five correlation degrees of one stage, two stages, three stages, four stages and five stages, a corresponding weight value can be set for each correlation degree, wherein the weight value corresponding to the first stage correlation degree is the lowest, then the weight value corresponding to the second stage correlation degree is higher than the weight value corresponding to the first stage correlation degree, and the weight value corresponding to the five stages of correlation degree is the highest. That is, the higher the degree of correlation, the higher the weight value corresponding to the degree of correlation.
In step 206, a target value corresponding to each correlation degree is determined according to the weight value corresponding to each correlation degree and the corresponding number of sub-searches.
For example, the corresponding target value under the correlation degree is obtained by multiplying the weight value corresponding to each correlation degree by the corresponding sub-search number. And if the first search text corresponds to a plurality of correlation degrees, acquiring a target value corresponding to each correlation degree.
For example, the number of sub-searches of the articles corresponding to the first-level correlation degree is 53, and the target value corresponding to the first-level correlation degree is obtained by multiplying 53 by the weight value corresponding to the first-level correlation degree. The number of sub searches of articles corresponding to the secondary correlation degree is 72, and a target value corresponding to the secondary correlation degree is obtained by multiplying 72 by a weight value corresponding to the primary correlation degree. The number of sub searches of articles corresponding to the three-level correlation degree is 5, and 53 is multiplied by a weight value corresponding to the three-level correlation degree to obtain a target value corresponding to the three-level correlation degree. The number of sub searches of articles corresponding to the four-level correlation degree is 30, and a target value corresponding to the four-level correlation degree is obtained by multiplying the weight value corresponding to the four-level correlation degree by 30. The number of sub searches of the articles corresponding to the five-level correlation degree is 0, and a target value corresponding to the five-level correlation degree is obtained by multiplying 0 by a weight value corresponding to the five-level correlation degree.
In step 207, a relevance score between the first search text and the corresponding search results is determined based on the target value and the number of searches for each relevance.
Specifically, the target values corresponding to each correlation degree may be added to obtain an addition result; and dividing the added result by the search quantity to determine a relevance score between the first search text and the corresponding search result.
In step 208, sentence features corresponding to the sample sentence are extracted, word segmentation features corresponding to the sample word segmentation text in the first search text are extracted, and statistical features of the sample word segmentation text in the first search text relative to the sample sentence are extracted.
And extracting sentence characteristics corresponding to the sample sentences. For example, the sample sentence is "A Dijia certain area teppanyaki". Each word in the sample sentence can be encoded, then a corresponding vector of each word is obtained, then the vectors of each word are added, so that a final corresponding vector of the sample sentence is obtained, and the final corresponding vector of the sample sentence is determined as sentence characteristics corresponding to the sample sentence. For example, the five words "a ground", "a certain", "region" and "teppanyaki" may be encoded respectively to obtain a vector corresponding to each word, and then the vectors corresponding to each word are added to obtain the sentence characteristics corresponding to the sample sentence.
And extracting word segmentation characteristics corresponding to the sample word segmentation text in the first search text. And coding each sample word segmentation text in the first search text to obtain a vector corresponding to each sample word segmentation text, and then obtaining a mean value corresponding to the vector corresponding to the sample word segmentation text, so as to obtain word segmentation characteristics corresponding to the sample word segmentation text in the first search text. For example, the first search text is "Jia-certain" and "teppanyaki", and the "Jia-certain" and "teppanyaki" can be respectively encoded to obtain vectors corresponding to the "Jia-certain" and "teppanyaki" respectively, and then a mean value corresponding to the vectors is obtained, so that word segmentation characteristics corresponding to the sample word segmentation text in the first search text are obtained.
And extracting statistical characteristics of the sample word segmentation text relative to the sample sentence in the first search text. The first number of words of the sample word segmentation text in the first search text can be determined, the second number of words in the sample sentence is determined, and then the first number and the second number are used as statistical features.
In step 209, sentence features, word segmentation features, and statistical features are determined as input features corresponding to the first search text.
After the sentence feature, the word segmentation feature and the statistical feature are obtained, all three can be determined to be input features corresponding to the first search text.
For another example, the sentence feature, the word segmentation feature, and the statistical feature may be spliced to form a multi-dimensional vector, and the multi-dimensional vector is determined as the input feature.
In step 210, input features corresponding to the first search text are input into the basic text filtering model, and an output value is obtained.
It can be appreciated that the input features corresponding to the first search text can be input as training samples, and the relevance scores corresponding to the first search text can be used as labels, so that training of the basic text screening model is achieved.
And inputting the input features corresponding to the first search text into a basic text screening model, wherein the basic text screening model can process the input features so as to obtain an output value.
In step 211, a loss value between the output value and the relevance score is determined according to a preset loss function.
In the application, a mean square error loss function can be adopted to determine the loss value between the corresponding output value of the first search text and the corresponding relevance score.
Wherein, the mean square error loss function formula is as follows:
wherein F is a mapping value corresponding to the basic text screening model, x is an input feature, y is a relevance score,is a Sigmoid function.
After calculating the loss value between the output value corresponding to the first search text and the corresponding correlation score through the preset loss function, if the loss value is within the preset loss value interval, the output result of the basic text screening model is considered to be accurate. If the loss value is not within the preset loss value interval, the output result of the basic text screening model is considered to be inaccurate, and the basic text screening model needs to be trained continuously.
In step 212, if the loss value meets the preset loss condition, training the basic text screening model is completed, and a text screening model is obtained.
For example, if the preset loss value interval is 0-0.2 and the loss value is 0.3, the output result of the basic text screening model is considered to be inaccurate, and at this time, the basic text screening model needs to be continuously trained.
And training the basic text screening model according to the input characteristics and the relevance scores corresponding to each first search text until the loss value corresponding to the output value of the basic text screening model meets the preset loss condition, and completing training the basic text screening model to obtain the text screening model.
In the embodiment of the application, the sample sentence is subjected to word segmentation processing through the first granularity level, so as to obtain a first word segmentation text. And performing word segmentation processing on the sample sentence through the second granularity level to obtain a second word segmentation text. And then, determining the text obtained by combining the first word segmentation text and the second word segmentation text as a first search text through the semantic sequence of the sample sentence. The text obtained by combining the first word segmentation texts according to the semantic sequence and the text obtained by combining the second word segmentation texts according to the semantic sequence are determined to be the first search text. And finally, according to the input characteristics of each first search text in the plurality of first search texts, inputting the input characteristics into the basic text screening model for training, thereby obtaining the text screening model. Compared with the search results obtained by searching only by utilizing keywords in the related art, the search results obtained by searching the target search text screened by the text screening model are more accurate.
After the text filtering model is obtained, the text filtering model can be directly applied in the search scene.
Referring to fig. 7, fig. 7 is a flowchart illustrating a text searching method according to an embodiment of the present application. The text search method may include the steps of:
In step 310, a search sentence is obtained, and word segmentation processing is performed on the search sentence according to at least two granularity levels, so as to obtain word segmentation texts corresponding to the at least two granularity levels respectively.
For example, the object may input a search term in a search box of a search engine through the terminal device, thereby obtaining the search term.
Referring to fig. 8 together, fig. 8 is a schematic diagram of a data flow of text searching according to an embodiment of the present application.
As shown in fig. 8, the segmenter 1 may set a first granularity level, the segmenter 2 sets a second granularity level, and then the search sentence is segmented using the first granularity level, so as to obtain a third segmented text. And performing word segmentation processing on the search sentence by adopting the second granularity level, so as to obtain a fourth word segmentation text.
In step 320, the word segmentation texts corresponding to the at least two granularity levels are combined to obtain a plurality of second search texts.
In some embodiments, text obtained by combining sample word segmentation texts corresponding to different granularity levels can be determined as second search text; and determining the text obtained by combining the sample word segmentation texts corresponding to each granularity level as a second search text.
For example, taking the third word segmentation text with the first granularity level and the fourth word segmentation text with the second granularity level as examples, at least part of the third word segmentation text and at least part of the fourth word segmentation text may be mixed and combined, so as to obtain a plurality of second search texts.
For example, taking the third segmented text with the first granularity level as an example, at least part of the segmented texts in the plurality of third segmented texts may be combined, so as to obtain a plurality of second search texts.
In step 330, each second search text is input into the text filtering model trained by the text filtering model training method, and the target search text is filtered out.
After obtaining the plurality of second search texts, each second search text may be input into a text filtering model, and the text filtering model may output an output value corresponding to each second search text. And determining the second search text with the highest output value from the plurality of second search texts as a target search text.
As shown in fig. 8, the text filtering model may input one or more second search texts at a time, and then the text filtering model may determine an output value corresponding to each second search text, where the output value is in the range of 0-1, and may determine the second search text with the highest output value as the target search text.
In step 340, the database is searched according to the target search text to obtain the target search result.
After the target search text is obtained, the database can be searched by directly utilizing the target search text, so that a search result corresponding to the target search text is searched.
For example, there are multiple articles in the database, and the multiple articles can be matched and searched through the target search text, so that target articles related to the target search text are searched, and the target articles are search results.
From the above, in the embodiment of the present application, the word segmentation processing is performed on the search statement according to at least two granularity levels by obtaining the search statement, so as to obtain word segmentation texts corresponding to at least two granularity levels respectively; combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts; inputting each second search text into a text screening model trained by the text screening model training method provided by the embodiment of the application, and screening out target search text; searching the database according to the target search text to obtain a target search result. In this way, in the embodiment of the application, word segmentation processing is performed on the search sentence according to at least two granularity levels to obtain word segmentation texts corresponding to at least two granularity levels respectively, then the word segmentation texts corresponding to at least two granularity levels respectively are combined to obtain a plurality of second search texts, and then a text screening model is used for screening out target search texts from the plurality of second search texts. Compared with the search results obtained by searching only by utilizing keywords in the related art, the search results obtained by searching the target search text screened by the text screening model are more accurate.
For an overall understanding of the text filtering model in the present application, please continue to refer to fig. 9, fig. 9 is another schematic view of a text search scenario provided in an embodiment of the present application.
As shown in fig. 9, wherein the model corresponds to a training phase and an application phase. In the training stage, the basic text screening model needs to be trained, and the text screening model is obtained after training. The text screening model may be used at the application stage.
In the training stage, after the sample sentence is input, the word segmentation device 1 and the word segmentation device 2 can perform word segmentation processing on the sample sentence, granularity levels of words corresponding to the word segmentation device 1 and the word segmentation device 2 are different, the word segmentation device 1 processes the sample sentence to obtain a first word segmentation text, and the word segmentation device 2 processes the first word segmentation text to obtain a second word segmentation text. The first segmented text and the second segmented text are then combined to obtain a plurality of first search texts. Searching the database by using each first search text so as to obtain a search result corresponding to each first search text, and then determining a relevance score between each first search text and the corresponding search result. And finally, acquiring the input features corresponding to each first search text, and training the basic text screening model according to the input features corresponding to each first search text and the corresponding correlation scores to obtain the text screening model.
The first search text may be used as a sample, and the relevance score corresponding to each first search text may be used as a label.
Specifically, after the input feature corresponding to each first search text is obtained, the input feature corresponding to each first search text may be input to a basic text screening model to train, the basic text screening model outputs a corresponding output value, the output value is a score predicted by the basic text screening model, then the output value may be compared with a relevance score serving as a label to obtain a difference, the basic text screening model is subjected to back propagation training according to the difference, and iteration is performed to continuously optimize the basic text screening model until an iteration condition is met, that is, the difference converges or the iteration number reaches a certain number of times, for example, the training number reaches 1000 times, and then the basic text screening model training is ended, so as to obtain the text screening model. The trained text screening model can screen out the text which is most suitable for searching the database from a plurality of search texts, so that the accuracy of the search result obtained by the subsequent database search is improved.
In the application stage, after the search sentence is input, the word segmentation device 1 and the word segmentation device 2 can perform word segmentation processing on the search sentence, the word segmentation device 1 processes the sample sentence to obtain a third word segmentation text, and the word segmentation device 2 processes the sample sentence to obtain a fourth word segmentation text. And then combining the third word segmentation text and the fourth word segmentation text to obtain a plurality of second search texts. The second search text is then entered into a text screening model that can screen out a target search text from among the plurality of second search texts. And finally searching the database according to the target search text, thereby obtaining a target search result.
According to the method and the device, the words of the search sentences are segmented through different granularity levels, and then the segmented words of different granularity levels are combined, so that the richness of the second search texts is increased, the target search texts which are most suitable for searching are screened out from a plurality of second search texts through the trained text screening model, more accurate search results can be searched out from the database through the target search texts in the searching process, and the searching accuracy is improved.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a text filtering model training device according to an embodiment of the application. The meaning of the nouns is the same as that in the text screening model training method, and specific implementation details can be referred to the description in the method embodiment.
As shown in fig. 10, the text filtering model training apparatus 400 includes: a first word segmentation module 410, a first combination module 420, a scoring module 430, an acquisition module 440, and a training module 450.
The first word segmentation module 410 is configured to obtain a sample sentence, and perform word segmentation processing on the sample sentence according to at least two granularity levels, so as to obtain a sample word segmentation text corresponding to at least two granularity levels respectively;
The first combination module 420 is configured to combine the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts;
the scoring module 430 is configured to search the database according to each first search text, obtain a search result corresponding to each first search text, and determine a relevance score between each first search text and the corresponding search result;
an obtaining module 440, configured to obtain an input feature corresponding to each first search text;
the training module 450 is configured to train the basic text filtering model according to the input feature corresponding to each first search text and the corresponding relevance score, so as to obtain a text filtering model.
In some embodiments, the first combining module 420 is configured to:
the text obtained by combining the sample word segmentation texts corresponding to different granularity levels is determined to be a first text;
and determining the text obtained by combining the sample word segmentation texts corresponding to each granularity level as the first text.
In some implementations, the first word segmentation module 410 is configured to:
performing word segmentation processing on the sample sentence according to the first granularity level to obtain a first word segmentation text corresponding to the first granularity level;
And performing word segmentation processing on the sample sentence according to the second granularity level to obtain a second word segmentation text corresponding to the second granularity level.
In some embodiments, the first combining module 420 is configured to: before combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts, determining the semantic sequence corresponding to the sample sentences;
in some embodiments, the first combining module 420 is configured to:
the text obtained by combining the first word segmentation text and the second word segmentation text according to the semantic sequence is determined to be a first search text;
and determining the text obtained by combining the first word segmentation texts according to the semantic sequence and the text obtained by combining the second word segmentation texts according to the semantic sequence as the first search text.
In some embodiments, the first combining module 420 is configured to: before combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts, numbering each first word segmentation text to obtain a first numbering result;
numbering each second word segmentation text to obtain a second numbering result;
In some embodiments, the first combining module 420 is configured to:
combining the first word segmentation text and the second word segmentation text according to the first numbering result and the second numbering result to obtain a plurality of combined texts;
determining an unambiguous combined text of the plurality of combined texts as a first search text;
combining the first word segmentation texts according to the first numbering result and combining the second word segmentation texts according to the second numbering result to obtain a plurality of combined texts;
an unambiguous combined text of the plurality of combined text is determined as a first search text.
In some embodiments, the search results include a plurality of sub-search results respectively corresponding to the degrees of correlation, the search results include a search number, and the sub-search results include a sub-search number; a scoring module 430 for:
acquiring a weight value corresponding to each correlation degree;
according to the weight value corresponding to each correlation degree and the corresponding sub-search quantity, determining a target value corresponding to each correlation degree;
and determining a relevance score between the first search text and the corresponding search result according to the target value and the search quantity corresponding to each relevance degree.
In some embodiments, the obtaining module 440 is configured to:
Extracting sentence characteristics corresponding to sample sentences;
extracting word segmentation characteristics corresponding to a sample word segmentation text in the first search text;
extracting statistical characteristics of a sample word segmentation text relative to a sample sentence in a first search text;
and determining the sentence characteristics, the word segmentation characteristics and the statistical characteristics as input characteristics corresponding to the first search text.
In some embodiments, training module 450 is configured to:
inputting the input features corresponding to the first search text into a basic text screening model to obtain an output value;
determining a loss value between the output value and the correlation score according to a preset loss function;
and if the loss value meets the preset loss condition, training the basic text screening model to obtain the text screening model.
The specific implementation of each module can be referred to the previous embodiments, and will not be repeated here.
In the embodiment of the application, a sample sentence is obtained, and word segmentation processing is carried out on the sample sentence according to at least two granularity levels, so that at least two sample word segmentation texts respectively corresponding to the granularity levels are obtained; combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts; searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result; acquiring input features corresponding to each first search text; and training the basic text screening model according to the corresponding input characteristics and the corresponding relevance scores of each first search text to obtain a text screening model. According to the method, word segmentation processing is carried out on sample sentences by utilizing different granularity levels to obtain word segmentation texts with a plurality of granularity levels corresponding to the sample sentences, then the word segmentation texts are combined according to at least two sample word segmentation texts with the granularity levels corresponding to the sample sentences to obtain a plurality of first search texts, the first search texts are used for searching a database to obtain search results corresponding to each first search text, then relevance scores between each first search text and the corresponding search results are determined, input features of each first search text are obtained, the relevance scores are used as labels, the input features are used as samples, training of a basic text screening model is achieved, a text screening model is obtained, the text screening model can identify the search text which is most suitable for searching, and then the database is searched by utilizing the search text to obtain the search results. Compared with the search results obtained by searching only by utilizing keywords in the related art, the search results obtained by searching the target search text screened by the text screening model are more accurate.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a text search device according to an embodiment of the application. Where the meaning of nouns is the same as in the text search method described above, specific implementation details may be referred to in the description of the method embodiments.
As shown in fig. 11, wherein the text searching apparatus 500 includes: a second word segmentation module 510, a second combination module 520, a screening module 530, and a search module 540.
The second word segmentation module 510 is configured to obtain a search sentence, and perform word segmentation processing on the search sentence according to at least two granularity levels, so as to obtain word segmentation texts corresponding to at least two granularity levels respectively;
the second combination module 520 is configured to combine the word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts;
a screening module 530, configured to input each second search text into a text screening model trained according to the text screening model training method of the present application, and screen out a target search text;
and the searching module 540 is used for searching the database according to the target searching text to obtain target searching results.
In some embodiments, the screening module 530 is configured to:
inputting each second search text into a text screening model trained by the text screening model training method, and outputting an output value corresponding to each second search text;
And determining the second search text with the highest output value from the plurality of second search texts as a target search text.
The specific implementation of each module can be referred to the previous embodiments, and will not be repeated here.
From the above, in the embodiment of the present application, the word segmentation processing is performed on the search statement according to at least two granularity levels by obtaining the search statement, so as to obtain word segmentation texts corresponding to at least two granularity levels respectively; combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts; inputting each second search text into a text screening model trained by the text screening model training method provided by the embodiment of the application, and screening out target search text; searching the database according to the target search text to obtain a target search result. In this way, in the embodiment of the application, word segmentation processing is performed on the search sentence according to at least two granularity levels to obtain word segmentation texts corresponding to at least two granularity levels respectively, then the word segmentation texts corresponding to at least two granularity levels respectively are combined to obtain a plurality of second search texts, and then a text screening model is used for screening out target search texts from the plurality of second search texts. Compared with the search results obtained by searching only by utilizing keywords in the related art, the search results obtained by searching the target search text screened by the text screening model are more accurate.
The embodiment of the application also provides computer equipment, which can be a server, wherein the server can be an independent physical server, can be a server cluster or a distributed system formed by a plurality of physical servers, and can be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
Referring to fig. 12, fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the application.
As shown in fig. 12, the computer device may be a server, which shows a schematic structural diagram of the server, specifically:
the computer device may include one or more processing cores 'processors 601, one or more computer-readable storage media's memory 602, power supply 603, and input unit 604, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 12 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components. Wherein:
Processor 601 is the control center of the computer device and connects the various parts of the overall computer device using various interfaces and lines to perform various functions and process data of the computer device by running or executing software programs and/or modules stored in memory 602 and invoking data stored in memory 602. Optionally, the processor 601 may include one or more processing cores; alternatively, the processor 601 may integrate an application processor that primarily handles operating systems, object interfaces, applications, etc., and a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 601.
The memory 602 may be used to store software programs and modules, and the processor 601 may execute various functional applications and data processing by executing the software programs and modules stored in the memory 602. The memory 602 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the server, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide access to the memory 602 by the processor 601.
The computer device further includes a power supply 603 for powering the various components, and optionally, the power supply 603 may be logically connected to the processor 601 by a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power supply 603 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The computer device may further comprise an input unit 604, which input unit 604 may be used for receiving input digital or character information and for generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with object settings and function control.
Although not shown, the computer device may further include a display unit or the like, which is not described herein. Specifically, in this embodiment, the processor 601 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 601 runs the application programs stored in the memory 602, so as to implement the steps in the text filtering model training method provided in the foregoing embodiment, as follows:
Obtaining a sample sentence, and performing word segmentation processing on the sample sentence according to at least two granularity levels to obtain sample word segmentation texts with at least two granularity levels respectively corresponding to the two granularity levels;
combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts;
searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result;
acquiring input features corresponding to each first search text;
and training the basic text screening model according to the corresponding input characteristics and the corresponding relevance scores of each first search text to obtain a text screening model.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of an embodiment that are not described in detail, reference may be made to the foregoing detailed description of the text filtering model training method, which is not repeated herein.
In the embodiment of the application, a sample sentence is obtained, and word segmentation processing is carried out on the sample sentence according to at least two granularity levels, so that at least two sample word segmentation texts respectively corresponding to the granularity levels are obtained; combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts; searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result; acquiring input features corresponding to each first search text; and training the basic text screening model according to the corresponding input characteristics and the corresponding relevance scores of each first search text to obtain a text screening model. According to the method, word segmentation processing is carried out on sample sentences by utilizing different granularity levels to obtain word segmentation texts with a plurality of granularity levels corresponding to the sample sentences, then the word segmentation texts are combined according to at least two sample word segmentation texts with the granularity levels corresponding to the sample sentences to obtain a plurality of first search texts, the first search texts are used for searching a database to obtain search results corresponding to each first search text, then relevance scores between each first search text and the corresponding search results are determined, input features of each first search text are obtained, the relevance scores are used as labels, the input features are used as samples, training of a basic text screening model is achieved, a text screening model is obtained, the text screening model can identify the search text which is most suitable for searching, and then the database is searched by utilizing the search text to obtain the search results. Compared with the search results obtained by searching only by utilizing keywords in the related art, the search results obtained by searching the target search text screened by the text screening model are more accurate.
In some embodiments, the processor 601 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 601 executes the application programs stored in the memory 602, so as to implement the steps in the text searching method provided in the foregoing embodiments, as follows:
obtaining a search sentence, and performing word segmentation processing on the search sentence according to at least two granularity levels to obtain word segmentation texts respectively corresponding to the at least two granularity levels;
combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts;
inputting each second search text into a text screening model trained by the text screening model training method provided by the embodiment of the application, and screening out target search text;
searching the database according to the target search text to obtain a target search result.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and the portions of an embodiment that are not described in detail in the foregoing embodiments may be referred to in the foregoing detailed description of the text searching method, which is not repeated herein.
From the above, in the embodiment of the present application, the word segmentation processing is performed on the search statement according to at least two granularity levels by obtaining the search statement, so as to obtain word segmentation texts corresponding to at least two granularity levels respectively; combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts; inputting each second search text into a text screening model trained by the text screening model training method provided by the embodiment of the application, and screening out target search text; searching the database according to the target search text to obtain a target search result. In this way, in the embodiment of the application, word segmentation processing is performed on the search sentence according to at least two granularity levels to obtain word segmentation texts corresponding to at least two granularity levels respectively, then the word segmentation texts corresponding to at least two granularity levels respectively are combined to obtain a plurality of second search texts, and then a text screening model is used for screening out target search texts from the plurality of second search texts. Compared with the search results obtained by searching only by utilizing keywords in the related art, the search results obtained by searching the target search text screened by the text screening model are more accurate.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the text screening model training methods provided by embodiments of the present application. For example, the instructions may perform the steps of:
obtaining a sample sentence, and performing word segmentation processing on the sample sentence according to at least two granularity levels to obtain sample word segmentation texts with at least two granularity levels respectively corresponding to the two granularity levels;
combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts;
searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result;
Acquiring input features corresponding to each first search text;
and training the basic text screening model according to the corresponding input characteristics and the corresponding relevance scores of each first search text to obtain a text screening model.
The computer readable storage medium has stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the text search methods provided by the embodiments of the present application. For example, the instructions may perform the steps of:
obtaining a search sentence, and performing word segmentation processing on the search sentence according to at least two granularity levels to obtain word segmentation texts respectively corresponding to the at least two granularity levels;
combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts;
inputting each second search text into a text screening model trained by the text screening model training method provided by the embodiment of the application, and screening out target search text;
searching the database according to the target search text to obtain a target search result.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various alternative implementations provided in the above embodiments.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Because the instructions stored in the computer readable storage medium can execute any step in the text screening model training method or the text searching method provided by the embodiment of the present application, the beneficial effects that any one of the text screening model training method or the text searching method provided by the embodiment of the present application can be realized, which are detailed in the previous embodiments and are not described herein.
The text screening model training method, the related method, the device, the medium and the equipment provided by the embodiment of the application are described in detail, and specific examples are applied to the description of the principle and the implementation mode of the application, and the description of the above embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims (15)

1. A text screening model training method, comprising:
obtaining a sample sentence, and performing word segmentation processing on the sample sentence according to at least two granularity levels to obtain at least two sample word segmentation texts corresponding to the granularity levels respectively;
combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts;
searching a database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result;
acquiring input features corresponding to each first search text;
and training the basic text screening model according to the input characteristics corresponding to each first search text and the corresponding relevance scores to obtain a text screening model.
2. The method for training a text filtering model according to claim 1, wherein the combining the sample word segmentation texts corresponding to at least two granularity levels to obtain a plurality of first search texts includes:
the text obtained by combining the sample word segmentation texts corresponding to different granularity levels is determined to be a first search text;
And determining the text obtained by combining the sample word segmentation texts corresponding to each granularity level as the first search text.
3. The training method of a text filtering model according to claim 2, wherein the granularity level includes a first granularity level and a second granularity level, the sample word segmentation text includes a first word segmentation text and a second word segmentation text, the word segmentation processing is performed on the sample sentence according to at least two granularity levels, so as to obtain at least two sample word segmentation texts corresponding to the granularity levels respectively, and the method includes:
performing word segmentation processing on the sample sentence according to the first granularity level to obtain a first word segmentation text corresponding to the first granularity level;
and performing word segmentation processing on the sample sentence according to the second granularity level to obtain a second word segmentation text corresponding to the second granularity level.
4. The text filtering model training method according to claim 3, wherein before combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts, the text filtering model training method comprises:
Determining a semantic sequence corresponding to the sample sentence;
the text obtained by combining the sample word segmentation texts corresponding to different granularity levels is determined to be a first search text, and the method comprises the following steps:
the text obtained by combining the first word segmentation text and the second word segmentation text is determined to be a first search text according to the semantic sequence;
the text obtained by combining the sample word segmentation texts corresponding to each granularity level is determined to be the first search text, and the method comprises the following steps:
and determining the text obtained by combining the first word segmentation texts according to the semantic order and the text obtained by combining the second word segmentation texts according to the semantic order as the first search text.
5. The text filtering model training method according to claim 3, wherein before combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts, the text filtering model training method comprises:
numbering each first word segmentation text to obtain a first numbering result;
numbering each second word segmentation text to obtain a second numbering result;
The text obtained by combining the sample word segmentation texts corresponding to different granularity levels is determined to be a first search text, and the method comprises the following steps:
combining the first word segmentation text and the second word segmentation text according to the first numbering result and the second numbering result to obtain a plurality of combined texts;
determining an unambiguous one of the plurality of combined text as a first search text;
the text obtained by combining the sample word segmentation texts corresponding to each granularity level is determined to be the first search text, and the method comprises the following steps:
combining the first word segmentation texts according to the first numbering result and combining the second word segmentation texts according to the second numbering result to obtain a plurality of combined texts;
determining the combined text that is unambiguous from the plurality of combined text as the first search text.
6. The text filtering model training method according to claim 1, wherein the search results comprise a plurality of sub-search results with corresponding correlation degrees, the search results comprise a search number, and the sub-search results comprise a sub-search number;
The determining a relevance score between the first search text and the corresponding search result includes:
acquiring a weight value corresponding to each correlation degree;
determining a target value corresponding to each correlation degree according to the weight value corresponding to each correlation degree and the corresponding sub-search quantity;
and determining a relevance score between the first search text and the corresponding search result according to the target value corresponding to each relevance degree and the search quantity.
7. The method for training a text filtering model according to claim 1, wherein the obtaining the input feature corresponding to the first search text includes:
extracting sentence characteristics corresponding to the sample sentences;
extracting word segmentation characteristics corresponding to a sample word segmentation text in the first search text;
extracting statistical characteristics of sample word segmentation texts in the first search text relative to the sample sentences;
and determining the sentence characteristics, the word segmentation characteristics and the statistical characteristics as input characteristics corresponding to the first search text.
8. The method for training a text filtering model according to claim 1, wherein training a basic text filtering model according to the input features corresponding to the first search text and the relevance score to obtain the text filtering model comprises:
Inputting the input features corresponding to the first search text into the basic text screening model to obtain an output value;
determining a loss value between the output value and the relevance score according to a preset loss function;
and if the loss value meets a preset loss condition, training the basic text screening model to obtain a text screening model.
9. A text search method, comprising:
obtaining a search sentence, and performing word segmentation processing on the search sentence according to at least two granularity levels to obtain word segmentation texts corresponding to at least two granularity levels respectively;
combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts;
inputting each second search text into a text screening model trained by the text screening model training method according to any one of claims 1-8, and screening out target search text;
and searching the database according to the target search text to obtain a target search result.
10. The method according to claim 9, wherein said inputting each of the second search texts into a text filtering model trained by the text filtering model training method according to any one of claims 1 to 8, filtering out target search texts, comprises:
Inputting each second search text into a text screening model trained by the text screening model training method according to any one of claims 1-8, and outputting an output value corresponding to each second search text;
and determining the second search text with the highest output value from the plurality of second search texts as a target search text.
11. A text screening model training device, comprising:
the first word segmentation module is used for obtaining a sample sentence, and carrying out word segmentation processing on the sample sentence according to at least two granularity levels to obtain at least two sample word segmentation texts corresponding to the granularity levels respectively;
the first combination module is used for combining the sample word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of first search texts;
the scoring module is used for searching the database according to each first search text to obtain a search result corresponding to each first search text, and determining a relevance score between each first search text and the corresponding search result;
the acquisition module is used for acquiring the input characteristics corresponding to each first search text;
And the training module is used for training the basic text screening model according to the input characteristics corresponding to each first search text and the corresponding correlation scores to obtain a text screening model.
12. A text search device, comprising:
the second word segmentation module is used for acquiring a search sentence, and carrying out word segmentation processing on the search sentence according to at least two granularity levels to obtain word segmentation texts corresponding to at least two granularity levels respectively;
the second combination module is used for combining word segmentation texts corresponding to at least two granularity levels respectively to obtain a plurality of second search texts;
a screening module, configured to input each of the second search texts into a text screening model trained according to the text screening model training method of any one of claims 1 to 8, and screen out a target search text;
and the searching module is used for searching the database according to the target searching text to obtain a target searching result.
13. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the text screening model training method of any of claims 1 to 8 or the steps of the text search method of claim 9 or 10.
14. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the text screening model training method of any of claims 1 to 8 or the steps of the text search method of claim 9 or 10 when the computer program is executed.
15. A computer program product comprising a computer program or instructions which, when executed by a processor, carries out the steps of the text screening model training method of any one of claims 1 to 8 or the steps of the text search method of claim 9 or 10.
CN202311255991.3A 2023-09-27 2023-09-27 Text screening model training method, related method, device, medium and equipment Active CN116991980B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311255991.3A CN116991980B (en) 2023-09-27 2023-09-27 Text screening model training method, related method, device, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311255991.3A CN116991980B (en) 2023-09-27 2023-09-27 Text screening model training method, related method, device, medium and equipment

Publications (2)

Publication Number Publication Date
CN116991980A true CN116991980A (en) 2023-11-03
CN116991980B CN116991980B (en) 2024-01-19

Family

ID=88525200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311255991.3A Active CN116991980B (en) 2023-09-27 2023-09-27 Text screening model training method, related method, device, medium and equipment

Country Status (1)

Country Link
CN (1) CN116991980B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479191A (en) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 Method and device for providing multi-granularity word segmentation result
CN111078858A (en) * 2018-10-19 2020-04-28 阿里巴巴集团控股有限公司 Article searching method and device and electronic equipment
US20200380038A1 (en) * 2019-05-27 2020-12-03 Microsoft Technology Licensing, Llc Neural network for search retrieval and ranking
CN113343132A (en) * 2021-06-30 2021-09-03 北京三快在线科技有限公司 Model training method, information display method and device
CN115292603A (en) * 2022-08-17 2022-11-04 广州华多网络科技有限公司 Commodity searching method, apparatus, device and medium
CN115545832A (en) * 2022-10-08 2022-12-30 广州欢聚时代信息科技有限公司 Commodity search recommendation method and device, equipment and medium thereof
CN116701570A (en) * 2023-05-12 2023-09-05 北京字跳网络技术有限公司 Recall result screening method, recall result screening device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479191A (en) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 Method and device for providing multi-granularity word segmentation result
CN111078858A (en) * 2018-10-19 2020-04-28 阿里巴巴集团控股有限公司 Article searching method and device and electronic equipment
US20200380038A1 (en) * 2019-05-27 2020-12-03 Microsoft Technology Licensing, Llc Neural network for search retrieval and ranking
CN113343132A (en) * 2021-06-30 2021-09-03 北京三快在线科技有限公司 Model training method, information display method and device
CN115292603A (en) * 2022-08-17 2022-11-04 广州华多网络科技有限公司 Commodity searching method, apparatus, device and medium
CN115545832A (en) * 2022-10-08 2022-12-30 广州欢聚时代信息科技有限公司 Commodity search recommendation method and device, equipment and medium thereof
CN116701570A (en) * 2023-05-12 2023-09-05 北京字跳网络技术有限公司 Recall result screening method, recall result screening device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN116991980B (en) 2024-01-19

Similar Documents

Publication Publication Date Title
CN110442718B (en) Statement processing method and device, server and storage medium
CN104050256A (en) Initiative study-based questioning and answering method and questioning and answering system adopting initiative study-based questioning and answering method
JP2022003537A (en) Method and device for recognizing intent of dialog, electronic apparatus, and storage medium
Gomez et al. Learning to learn from web data through deep semantic embeddings
CN113239169A (en) Artificial intelligence-based answer generation method, device, equipment and storage medium
Stancheva et al. A model for generation of test questions
Zhang et al. Hierarchical scene parsing by weakly supervised learning with image descriptions
CN110795565A (en) Semantic recognition-based alias mining method, device, medium and electronic equipment
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN112905768A (en) Data interaction method, device and storage medium
CN113761887A (en) Matching method and device based on text processing, computer equipment and storage medium
Whitehead et al. Learning from lexical perturbations for consistent visual question answering
CN112463914B (en) Entity linking method, device and storage medium for internet service
CN114490926A (en) Method and device for determining similar problems, storage medium and terminal
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN109977194B (en) Text similarity calculation method, system, device and medium based on unsupervised learning
CN116991980B (en) Text screening model training method, related method, device, medium and equipment
Meng [Retracted] An Intelligent Code Search Approach Using Hybrid Encoders
CN116186220A (en) Information retrieval method, question and answer processing method, information retrieval device and system
Sangeetha et al. Information retrieval system for laws
CN112052320B (en) Information processing method, device and computer readable storage medium
Zha et al. M2conceptbase: A fine-grained aligned multi-modal conceptual knowledge base
Allahim et al. A Hybrid Approach for Optimizing Arabic Semantic Query Expansion
CN117725153B (en) Text matching method, device, electronic equipment and storage medium
Amorim Evaluating Pre-trained Word Embeddings in domain specific Ontology Matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant