CN109033478B - Text information rule analysis method and system for search engine - Google Patents

Text information rule analysis method and system for search engine Download PDF

Info

Publication number
CN109033478B
CN109033478B CN201811062638.2A CN201811062638A CN109033478B CN 109033478 B CN109033478 B CN 109033478B CN 201811062638 A CN201811062638 A CN 201811062638A CN 109033478 B CN109033478 B CN 109033478B
Authority
CN
China
Prior art keywords
text
original document
phrase
sample
target sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811062638.2A
Other languages
Chinese (zh)
Other versions
CN109033478A (en
Inventor
郑燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Industry Polytechnic College
Original Assignee
Chongqing Industry Polytechnic College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Industry Polytechnic College filed Critical Chongqing Industry Polytechnic College
Priority to CN201811062638.2A priority Critical patent/CN109033478B/en
Publication of CN109033478A publication Critical patent/CN109033478A/en
Application granted granted Critical
Publication of CN109033478B publication Critical patent/CN109033478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a text information rule analysis method for a search engine, which comprises the following steps: acquiring a text of a natural language original document; extracting the features of the text of the natural language original document to generate a text feature vector; matching the text with samples in a sample library by using a pre-trained vector matching model according to the text characteristic vector to obtain a target sample; determining a semantic distribution rule mode of a text by utilizing a pre-trained semantic distribution rule mode determination model according to the text characteristic consistency between the original sample document of the target sample and the corresponding target sample index set; and converting the text of the natural language original document into an index set according to the semantic distribution rule mode of the text. The application also provides a text information rule analysis system. The method and the device have the advantages that the word distribution rule of the core semantic bearing words of the original documents of the natural language is explored, and the extraction of the search engine index items with high accuracy is realized.

Description

Text information rule analysis method and system for search engine
Technical Field
The application relates to the technical field of internet application, in particular to a text information rule analysis method and system for a search engine.
Background
Search engines are necessary tools for people to acquire self-needed knowledge and information from mass data of the internet. Search engines were first generated by the search requirement for text information, and text information search is still one of the main functions of search engines at present.
In the process of searching text information, a search engine extracts index items from original documents in the Internet through an indexer, the index items are generally a plurality of words appearing in the original documents, and the index items and original document links corresponding to the index items are stored in an index table. And then, the matched index items are inquired in the index database according to the inquiry key words of the user through the retriever, and the original document is quickly detected. The retriever also evaluates the relevance of the original document and the query keyword, sorts the results to be output, and displays the search results including links pointing to the original document to the user. In the above searching process, it is a relatively complex process to extract the index from the original document. Because the original document written in natural language has the words bearing the core semantics submerged in a large number of other word expressions, the words bearing the core semantics are not necessarily the words with the word frequency (i.e. the occurrence number or proportion of the words in the document) being superior, the grammar of natural language also lacks the clearly definable rules or marks to assist in identifying the core semantics words, the words bearing the core semantics are not always distributed in fixed positions in the original document, that is, the rules between the core semantics in natural language and the text information of the original document are hidden and distorted. In the prior art, a search engine mainly utilizes a word frequency statistical rule and combines a weight distribution rule based on an article structure to extract index items in natural language documents, so that wrong extraction results often occur, namely the extracted index items do not reflect the core semantics of the original documents, and words bearing the core semantics are missed, and the errors are easy to occur particularly in short documents without additional semantic prompt tags.
In natural reading activities, human beings rely on the understanding ability accumulated in life and Chinese learning to find words as their core semantic carriers from a piece of document, but there are still great obstacles to reproducing the human reading understanding ability by using a computer.
Artificial Intelligence (AI) is a branch of computer science that attempts to understand the essence of Intelligence and produce a new intelligent machine that can react in a manner similar to human Intelligence, and research in this field includes robotics, speech recognition, image recognition, natural language processing, and expert systems. Since the birth of artificial intelligence, theories and technologies are mature day by day, and application fields are expanded continuously. In the field of text learning, artificial intelligence technology has been applied to many aspects of natural language semantic recognition, machine translation, and the like. From the potential of artificial intelligence to simulate human intelligence activity, developers of search engines generally want to apply the technology to the analysis of text information rules, so as to help extract index items carrying core semantics from original documents of natural language, especially segment documents without auxiliary information such as tags.
Disclosure of Invention
In view of this, the present application aims to provide a method and a system for analyzing a text information rule for a search engine, so as to solve the technical problem in the prior art that extraction of a search engine index item is difficult and erroneous due to an unobvious and uncertain existence rule of words bearing document core semantics.
In one aspect of the present application, a method for analyzing a text message rule for a search engine is provided, including:
acquiring a text of a natural language original document;
extracting the features of the text of the natural language original document to generate a text feature vector;
matching the text with samples in a sample library by using a pre-trained vector matching model according to the text feature vector to obtain a target sample, wherein the sample comprises a sample index set and sample original documents corresponding to the sample index set;
determining a semantic distribution rule mode of a text by utilizing a pre-trained semantic distribution rule mode determination model according to the text characteristic consistency between the original sample document of the target sample and the corresponding target sample index set;
and converting the text of the natural language original document into an index set according to the semantic distribution rule mode of the text.
In some embodiments, the extracting the features of the text of the natural language original document to generate a text feature vector includes:
extracting phrases in the text, classifying the attributes of the phrases, counting the word frequency of each category of phrases, and generating text characteristic vectors according to the categories of the phrases and the word frequency of each category of phrases.
In some embodiments, the extracting a phrase in the text, performing attribute classification on the phrase, and performing statistics on word frequency of each category of phrases includes:
and segmenting the text into a plurality of word groups, classifying each word group, determining the attribute category of each word group, and performing word frequency statistics on the word groups of each attribute category.
In some embodiments, classifying each phrase and determining the attribute category of each phrase specifically includes:
and constructing a phrase attribute classification table, wherein the phrase attribute classification table comprises phrase attribute categories and phrase semantics corresponding to the categories, performing semantic recognition on each phrase, and determining the phrase attribute categories of the phrases.
In some embodiments, after segmenting the text into words, segmenting the text into a plurality of word groups, and performing semantic recognition on each word group, the method further includes:
and performing stop word removing, filtering and denoising on the plurality of phrases after the semantic recognition, and filtering noise phrases contained in the plurality of phrases.
In some embodiments, the matching the text with the samples in the sample library according to the text feature vector by using a pre-trained vector matching model includes:
pre-training a neural network model, generating a vector matching model, calculating the standard deviation of the text feature vector of the current natural language original document text and the text feature vector of the sample original document in the sample library by using the vector matching model, matching successfully when the standard deviation is less than a preset threshold value, and taking the successfully matched sample original document as a target sample original document.
In some embodiments, the determining, by using a pre-trained semantic distribution rule pattern determining model, a semantic distribution rule pattern of a text according to text feature consistency between a sample original document of the target sample and a corresponding target sample index set includes:
and calculating text characteristic vectors of the target sample original document and the corresponding target sample index set, and determining a semantic distribution rule mode of the text according to the consistency of the phrase frequencies of similar phrases in the text characteristic vectors of the target sample original document and the corresponding target sample index set.
In another aspect of the present application, a text information rule analysis system for a search engine is provided, including:
the text acquisition module is used for acquiring the text of the natural language original document;
the text feature vector generation module is used for extracting features of the text of the natural language original document to generate a text feature vector;
the vector matching module is used for matching the text of the natural language original document with samples in a sample library according to the text characteristic vector to obtain a target sample;
the semantic distribution rule mode determining module is used for determining a semantic distribution rule mode of the text according to the text characteristic consistency between the target sample original document and the corresponding target sample index set;
and the index set generation module is used for converting the text of the natural language original document into an index set according to the semantic distribution rule mode of the text.
In some embodiments, the text feature vector generation module is specifically configured to:
extracting phrases in the text, classifying the phrases according to the attributes, counting the word frequency of each attribute type phrase, and generating a text characteristic vector according to the phrase attribute type and the word frequency of each type phrase.
In some embodiments, the semantic distribution rule pattern determining module is specifically configured to:
and calculating text characteristic vectors of the target sample original document and the corresponding target sample index set, and determining a semantic distribution rule mode according to the consistency of the phrase frequencies of similar phrases in the text characteristic vectors of the target sample original document and the corresponding target sample index set.
The text information rule analysis method and system for the search engine, provided by the embodiment of the application, are used for extracting the features of the text of the natural language original document to generate a text feature vector; matching the text with samples in a sample library by utilizing a pre-trained vector matching model according to the text feature vector to obtain a target sample, and determining a semantic distribution rule mode of the text according to the text feature consistency between a sample original document of the target sample and a corresponding target sample index set; and converting the text of the natural language original document into an index set according to the semantic distribution rule mode of the text. According to the method, the word distribution rule bearing core semantics is explored for the original document of the natural language through the artificial intelligence learning method, and the extraction of the search engine index item with high accuracy is achieved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
fig. 1 is a flowchart of a text information rule analysis method for a search engine according to a first embodiment of the present application;
fig. 2 is a flowchart of a text information rule analysis method for a search engine according to a second embodiment of the present application;
fig. 3 is a schematic structural diagram of a text information rule analysis system for a search engine according to a third embodiment of the present application;
fig. 4 is a flowchart illustrating a fourth embodiment of the present application, where the index set is generated by using the system for analyzing a text information rule for a search engine according to the embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 is a flowchart illustrating a method for analyzing a text information rule for a search engine according to a first embodiment of the present application. As can be seen from the figure, the method for analyzing the regularity of the text information for the search engine provided by the embodiment includes the following steps:
s101: text of an original document in natural language is obtained.
In this embodiment, the text of the natural language original document may be manually input or may be automatically acquired by the system. In this embodiment and the following embodiments, the original document of natural language refers to a text, for example, "light color is a numerical value in optics that represents light color by using K (kevin) as a calculation unit, light color generally contacted in life is 2700K to 6500K, light source illumination of light color exceeding 7000K is used for industrial illumination and special fields (such as automobile illumination), or" a highway indicates the driving speed of a lane, the maximum vehicle speed is not more than 120 km per hour, the minimum vehicle speed is not less than 60 km per hour, the maximum vehicle speed of a small passenger car driving on the highway is not more than 120 km per hour, other vehicles are not more than 100 km per hour, and a motorcycle is not more than 80 km per hour ". The search engine can search and assemble texts of natural language original documents in a sea level from original data of web pages, electronic books, papers and the like
S102: and performing feature extraction on the text of the natural language original document to generate a text feature vector.
In this embodiment, after the text of the natural language original document is obtained, feature extraction may be performed on the text to generate a text feature vector. Specifically, the text may be divided into a plurality of phrases, and then phrases without practical meaning may be removed by the stop word processing, and the stop word processing may be implemented with reference to a common stop word list; the stop word removing means that a plurality of phrases obtained by dividing the word are filtered and denoised, and noise phrases contained in the plurality of phrases are filtered; because the text may contain associated words and adverbs, and the phrases have no actual meanings in the process of performing semantic recognition on the text, a plurality of phrases after the semantic recognition can be filtered and denoised, phrases without actual meanings such as the associated words and the adverbs are filtered, and the workload of a machine can be greatly reduced.
Then, classifying the reserved phrases, classifying the phrases into classes of preset types, and then counting word frequency by taking each class as a unit, namely the number of the phrases of each class in the original document; and generating a text feature vector according to the category of the phrases and the number of the phrases in the corresponding category. Still take the example of "the driving speed of a lane is marked on a highway, the highest speed should not exceed 120 kilometers per hour, the lowest speed should not be lower than 60 kilometers per hour, the highest speed of a small passenger car driving on the highway should not exceed 120 kilometers per hour, other motor vehicles should not exceed 100 kilometers per hour, and a motorcycle should not exceed 80 kilometers per hour" as an example, in this example, the category of the phrase may include: the term "passenger car", "other motor vehicle" and "motorcycle", and the term "quantity" includes "120 km/h", "100 km/h", "80 km/h" and "60 km/h".
For the above-mentioned phrase classification, a phrase category index table may be established, in which common phrases corresponding to each category are recorded, and phrases that are extracted from the original document text of the natural language and remain after the stop word is removed are classified into a phrase category corresponding to the index table by calling the corresponding phrase category index table.
Furthermore, by using the statistical phrase categories and the word frequencies (phrase number) of each category, the text of the natural language original document is generated into corresponding text feature vectors which are expressed as { (S1, N1), (S2, N2) … (Sn, Nn) }, wherein S1 and S2 … Sn are phrase categories, such as concept phrases, number phrases and the like in the text; n1 and N2 … Nn are the word frequency of each phrase category, that is, the number of phrases classified under the category; for example, in the above-mentioned material text, the extracted text feature vector should be { (concept phrase, 3), (quantity phrase, 4) }, where the numbers 3 and 4 represent word frequencies.
S103: and matching the text with samples in a sample library by utilizing a pre-trained vector matching model according to the text feature vector to obtain a target sample, wherein the sample comprises a sample index set and sample original documents corresponding to the sample index set.
In this embodiment, after generating the text feature vector of the text of the natural language original document, the text feature vector may be matched with the samples in the sample library using a vector matching model. The samples in the sample library include a large sample index set and sample original documents corresponding to the sample index set. Specifically, the vector matching model is a neural network model generated by learning a large number of samples in a sample library, so that the vector matching model outputs a sample original document with higher similarity to the text of the input natural language original document on the premise that the input text is the natural language original document, wherein the similarity refers to the similarity between text feature vectors of the text and includes the similarity between categories of phrases and the similarity of the number of phrases of the same category.
And the vector matching model is used as a pre-training neural network model, after the text characteristic vector of the current natural language original document is input, the standard deviation of the text characteristic vector of the current natural language original document and the text characteristic vector of each sample original document in the sample library is calculated and output, when the standard deviation is smaller than a preset threshold value, the matching is successful, and the sample original document which is successfully matched is used as the target sample original document. Specifically, if the text feature vector of the original document in natural language is { (S1, N1), (S2, N2) … (Sn, Nn) }, and the text feature vector of the sample original document text { (S1, N1 '), (S2, N2 ') … (Sn, Nn ') }, the standard deviation of the two text feature vectors is expressed as
Figure BDA0001797492090000081
If epsilon is less than the threshold, the matching is considered successful and the target sample original document corresponds to the current natural language original document.
S104: and determining a model by utilizing a pre-trained semantic distribution rule mode, and determining the semantic distribution rule mode of the text according to the text characteristic consistency between the original document of the target sample and the corresponding target sample index set.
In this embodiment, after determining the target sample original document corresponding to the natural language original document text by using the vector matching model, the phrase category related to the index word in the index set may be determined according to the text feature consistency between the sample original document and the target sample index set corresponding to the sample original document, and further, the phrase category related to the index set of the natural language original document may be determined according to the phrase category of the target sample index set.
Specifically, the semantic distribution rule pattern determining model in this embodiment is a neural network model generated by learning a large number of samples in a sample library, and by learning a large number of sample index sets in the sample library and sample original documents corresponding to the sample index sets, the semantic distribution rule pattern determining model can determine consistency of text feature vectors of texts of the input sample index sets and corresponding sample original documents, and determine a phrase category related to an index word in the index set according to the consistency. Specifically, the semantic distribution rule mode determining model calculates text feature vectors of the sample original document and the corresponding sample index set, and determines a phrase type with a higher word frequency in the sample original document and the corresponding target sample index set as a phrase type related to the index set according to a phrase frequency of a similar phrase in the text feature vectors of the target sample original document and the corresponding target sample index set. .
Taking the following example as an example, the sample original document is a text "light color is a numerical value in optics that represents light color by using K (kevin) as a calculation unit, the light color generally contacted in life is 2700K to 6500K, and industrial lighting and special fields (such as automobile lighting) use illuminant lighting with light color exceeding 7000K, the phrase category of the sample original document includes concept phrases and number phrases, wherein the extracted" light color "," optics "," lighting "," illuminant "belongs to the concept phrases," 2700K "," 6500K "," 7000K "belongs to the number phrases, the text feature vector is { (concept phrase, 4), (number phrase, 3) }, the corresponding sample index set includes index words of" light color "," illuminant "," optical ", the text feature vector of the sample index set may be { (concept phrase, 3), (number phrase, 0) and the consistency of the two text feature vectors is that the word frequency on the concept phrase dimension is higher, so that the phrase category related to the index set is determined to be the concept phrase. Index set
S105: and converting the text of the original document of the natural language into an index set according to the semantic distribution rule mode of the text.
In step 103, the similarity of the text feature vectors of the current natural language original document text and the sample original documents of the samples in the sample library is obtained, the sample original document which is most matched with the current natural language original document text is determined, and then the phrase type related to the index set is determined according to the consistency between the sample original document and the sample index set, so that phrases of the same type in the current natural language original document can be selected in the same text semantic distribution rule mode to serve as the index set of the current original document, and the text of the natural language original document is converted into the index set.
The text information rule analysis method for the search engine of the embodiment of the application extracts the features of the text of the natural language original document, then matching the text feature vector with samples in a sample library to obtain a target sample, determining a model by utilizing a pre-trained semantic distribution rule mode, determining a semantic distribution rule mode of the text according to the text characteristic consistency between the sample original document of the target sample and the corresponding target sample index set, converting the text of the natural language original document into the index set according to the semantic distribution rule mode, therefore, the problem of extraction of an index set of the original text of the natural language, particularly the short text without index, is solved through machine learning of the sample, the word distribution rule of the bearing core semantics of the original document of the natural language can be found, and the extraction of the search engine index item with high accuracy is realized.
Fig. 2 is a flowchart of a text information rule analysis method for a search engine according to a second embodiment of the present application. As a specific embodiment of the present application, the text information rule analysis method for a search engine includes the following steps:
s201: text of an original document in natural language is obtained.
In this embodiment, the text of the natural language original document may be text of a natural language original document that is searched for and assembled in a large scale by a search engine from raw data of a web page, an electronic book, a paper, and the like. Please refer to the first embodiment specifically, which is not described herein again.
S202: the method comprises the steps of segmenting the text into words, segmenting the text into a plurality of word groups, carrying out semantic recognition on each word group, determining the attribute category of each word group, and classifying the word groups of the same attribute category.
After the text is segmented into words, the text can be segmented into a plurality of phrases, each phrase is semantically identified according to the word meaning of each phrase, the attribute category of each phrase is determined, and the phrases with the same attribute category are classified. Specifically, a phrase attribute classification table may be constructed, where the phrase attribute classification table includes a phrase attribute category and a phrase semantic corresponding to the category, and performs semantic recognition on each phrase to determine the phrase attribute category of the phrase.
S203: and counting the phrase frequency in the phrase attribute categories, and generating text characteristic vectors according to the phrase attribute categories and the word frequency of each attribute phrase.
S204: and matching the text with samples in a sample library by utilizing a pre-trained vector matching model according to the text feature vector to obtain a target sample, wherein the sample comprises a sample index set and sample original documents corresponding to the sample index set.
S205: and determining a model by utilizing a pre-trained semantic distribution rule mode, and determining the semantic distribution rule mode of the text according to the text characteristic consistency between the original document of the target sample and the corresponding target sample index set.
S206: and converting the text of the original document of the natural language into an index set according to the semantic distribution rule mode of the text.
The present embodiment can achieve similar technical effects as the above embodiments, and will not be described herein again.
Fig. 3 is a schematic structural diagram of a text information rule analysis system for a search engine according to a third embodiment of the present application. The system for analyzing the text information rule for the search engine provided by the embodiment comprises:
a text obtaining module 301, configured to obtain a text of the natural language original document.
The text feature vector generation module 302 is used for extracting features of the text to generate a text feature vector;
the vector matching module 303 is configured to match the text with samples in a sample library according to the text feature vector to obtain a target sample, where the sample includes a sample index set and a sample original document corresponding to the sample index set;
a semantic distribution rule pattern determining module 304, configured to determine a semantic distribution rule pattern of a text according to text feature consistency between the target sample original document and a corresponding target sample index set;
an index set generating module 305, configured to convert the text of the natural language original document into an index set according to the semantic distribution rule mode of the text.
Further, the text feature vector generation module 302 is specifically configured to:
extracting phrases in the text, performing attribute classification on the phrases, counting word frequency of each attribute type of phrase, and generating text characteristic vectors according to the attribute type of the phrase and the word frequency of each type of phrase.
The semantic distribution rule pattern determining module 304 is specifically configured to:
and calculating text characteristic vectors of the target sample original document and the corresponding target sample index set, and determining a semantic distribution rule mode according to the consistency of the phrase frequencies of similar phrases in the text characteristic vectors of the target sample original document and the corresponding target sample index set.
The text information rule analysis system for the search engine of the embodiment can achieve the similar technical effects as those of the method embodiments, and the details are not repeated here.
Fig. 4 is a schematic flowchart illustrating a process of generating an index set by using the system for analyzing a text information rule for a search engine according to the fourth embodiment of the present application. As can be seen from fig. 4, when an index set of a search engine is generated using the text information rule analysis system for a search engine according to an embodiment of the present application, a natural language original document text may be input. In this embodiment, the vector matching module is a pre-trained neural network model, and when the text feature vector of the current natural language original document is input, a standard deviation between the text feature vector of the current natural language original document and the text feature vector of each sample original document in the sample library is calculated and output, and when the standard deviation is smaller than a preset threshold value, matching is successful, and the sample original document successfully matched is used as a target sample original document. Specifically, a large number of sample original documents stored in a sample library may be utilized to perform learning training on a neural network model in advance to generate the vector matching module, so that the vector matching module performs matching according to a text feature vector of an input natural language original document text and a text feature vector of a sample original document in the sample library. Because the text feature vector comprises the type of phrases in the text and the number of similar phrases, in the process of matching the text of the natural language original document with the sample original document by the vector matching module, matching can be performed based on the text of the natural language original document, the phrases contained in the sample original document and the number of corresponding phrases, and after the sample original document corresponding to the text of the natural language original document is obtained, the text semantic distribution rule pattern is determined by the semantic distribution rule pattern determining module according to the text feature consistency of the sample original document and the sample index set corresponding to the sample original document. Specifically, the text semantic distribution rule mode determining module determines consistency of phrase frequencies of similar phrases in the text feature vectors of the input sample original document and the corresponding sample index set according to the text feature vectors of the sample original document and the corresponding sample index set, and determines a semantic distribution rule mode. And the index set generation module is used for extracting phrases of the same category from the text of the natural language original document according to the semantic distribution rule mode of the text and converting the phrases into an index set.
The foregoing description is only exemplary of the preferred embodiments of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (4)

1. A method for analyzing the rule of text information of a search engine is characterized by comprising the following steps: acquiring a text of an original document of a natural language; extracting the features of the text of the natural language original document to generate a text feature vector; matching the text with samples in a sample library by using a pre-trained vector matching model according to the text feature vector to obtain a target sample, wherein the sample comprises a sample index set and sample original documents corresponding to the sample index set; determining a semantic distribution rule mode of a text by utilizing a pre-trained semantic distribution rule mode determination model according to the text characteristic consistency between the original sample document of the target sample and the corresponding target sample index set; the semantic distribution rule mode determines the phrase types with high word frequency in the original document of the target sample and the corresponding word frequency of the same phrase in the text characteristic vector of the target sample index set as the phrase types related to the index set according to the phrase frequencies of the same phrase in the original document of the target sample and the corresponding text characteristic vector of the target sample index set; determining the phrase type related to the index set according to the semantic distribution rule mode of the text and the consistency between the sample original document and the sample index set, selecting phrases of the same type in the current natural language original document in the same text semantic distribution rule mode to serve as the index set of the current original document, and converting the text of the natural language original document into the index set; the feature extraction of the text of the natural language original document to generate a text feature vector comprises the following steps: extracting phrases in the text, performing attribute classification on the phrases, counting word frequency of each category of phrases, and generating text characteristic vectors according to the categories of the phrases and the word frequency of each category of phrases; the extracting the word group in the text, classifying the attribute of the word group, and counting the word frequency of each category of word group includes: dividing the text into a plurality of word groups, classifying each word group, determining the attribute category of each word group, and performing word frequency statistics on the word groups of each attribute category; classifying each phrase, and determining the attribute category of each phrase, which specifically comprises the following steps: establishing a phrase category index table, recording common phrases corresponding to each category in the phrase category index table, and classifying phrases which are extracted from a natural language original document text and are reserved after stop words are removed into phrase categories corresponding to the index table by calling the corresponding phrase category index table; generating corresponding text feature vectors for texts of the natural language original documents by utilizing statistical phrase categories and word frequencies of each category, wherein the text feature vectors are represented as { (S1, N1), (S2, N2) … (Sn, Nn) }, S1 and S2 … Sn are phrase categories, and N1 and N2 … Nn are word frequencies of each phrase category; after segmenting the text into a plurality of phrases and performing semantic recognition on each phrase, the method further comprises the following steps: carrying out stop word removing, filtering and denoising on a plurality of phrases after semantic recognition, and filtering noise phrases contained in the plurality of phrases; the matching the text with the samples in the sample library according to the text feature vectors by using the pre-trained vector matching model comprises: pre-training a neural network model, generating a vector matching model, calculating the standard deviation of the text feature vector of the current natural language original document text and the text feature vector of the sample original document in the sample library by using the vector matching model, matching successfully when the standard deviation is less than a preset threshold value, and taking the successfully matched sample original document as a target sample original document.
2. The method according to claim 1, wherein the determining a semantic distribution rule pattern of a text according to a text feature consistency between a sample original document of the target sample and a corresponding target sample index set by using a pre-trained semantic distribution rule pattern determination model comprises: and calculating text feature vectors of the target sample original document and the corresponding target sample index set, and determining a semantic distribution rule mode of the text according to the consistency of the phrase frequencies of similar phrases in the text feature vectors of the target sample original document and the corresponding target sample index set.
3. A system for analyzing regularity of text information for a search engine, comprising: the text acquisition module is used for acquiring the text of the natural language original document; the text feature vector generation module is used for extracting features of the text of the natural language original document to generate a text feature vector; the vector matching module is used for matching the text of the natural language original document with samples in a sample library according to the text feature vector to obtain a target sample; a semantic distribution rule mode determining module, configured to determine a semantic distribution rule mode of a text according to text feature consistency between the target sample original document and a corresponding target sample index set, where the semantic distribution rule mode determining module is specifically configured to: calculating text feature vectors of the target sample original documents and the corresponding target sample index set, and determining a semantic distribution rule mode according to the consistency of phrase frequencies of similar phrases in the text feature vectors of the target sample original documents and the corresponding target sample index set; an index set generation module, configured to convert the text of the natural language original document into an index set according to a semantic distribution rule pattern of the text, determine a phrase category related to the index set according to consistency between the sample original document and the sample index set, and determine a phrase type having a high word frequency in both of the target sample original document and a corresponding target sample index set as the phrase category related to the index set according to a phrase frequency of a similar phrase in a text feature vector of the target sample original document and the target sample index set; selecting phrases of the same category in the current natural language original document in the same text semantic distribution rule mode to serve as an index set of the current original document; the text feature vector generation module is configured to: extracting phrases in the text, carrying out attribute classification on the phrases, counting word frequency of each attribute type phrase, and generating a text characteristic vector according to the phrase attribute type and the word frequency of each type phrase; the extracting the word group in the text, classifying the attribute of the word group, and counting the word frequency of each category of word group includes: segmenting the text into words, segmenting the text into a plurality of word groups, classifying each word group, determining the attribute category of each word group, performing word frequency statistics on the word group of each attribute category, classifying each word group, and determining the attribute category of each word group, which specifically comprises the following steps: establishing a phrase category index table, recording common phrases corresponding to each category in the phrase category index table, and classifying phrases which are extracted from a natural language original document text and are reserved after stop words are removed into phrase categories corresponding to the index table by calling the corresponding phrase category index table; generating corresponding text feature vectors for texts of the natural language original documents by using statistical phrase categories and word frequencies of each category, wherein the text feature vectors are represented as { (S1, N1), (S2, N2) … (Sn, Nn) }, S1 and S2 … Sn are phrase categories, and N1 and N2 … Nn are word frequencies of each phrase category; after the word segmentation is performed on the text, the text is divided into a plurality of word groups, and each word group is subjected to semantic recognition, the method further comprises the following steps: the method comprises the following steps of carrying out stop word removal, filtering and denoising on a plurality of phrases after semantic recognition, filtering noise phrases contained in the phrases, and matching the text with samples in a sample library according to the text feature vector by using a pre-trained vector matching model, wherein the steps comprise: pre-training a neural network model, generating a vector matching model, calculating the standard deviation of the text feature vector of the current natural language original document text and the text feature vector of the sample original document in the sample library by using the vector matching model, matching successfully when the standard deviation is less than a preset threshold value, and taking the successfully matched sample original document as a target sample original document.
4. The system of claim 3, wherein the semantic distribution rule pattern determining module is specifically configured to: and calculating text characteristic vectors of the target sample original document and the corresponding target sample index set, and determining a semantic distribution rule mode according to the consistency of the phrase frequencies of similar phrases in the text characteristic vectors of the target sample original document and the corresponding target sample index set.
CN201811062638.2A 2018-09-12 2018-09-12 Text information rule analysis method and system for search engine Active CN109033478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811062638.2A CN109033478B (en) 2018-09-12 2018-09-12 Text information rule analysis method and system for search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811062638.2A CN109033478B (en) 2018-09-12 2018-09-12 Text information rule analysis method and system for search engine

Publications (2)

Publication Number Publication Date
CN109033478A CN109033478A (en) 2018-12-18
CN109033478B true CN109033478B (en) 2022-08-19

Family

ID=64621773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811062638.2A Active CN109033478B (en) 2018-09-12 2018-09-12 Text information rule analysis method and system for search engine

Country Status (1)

Country Link
CN (1) CN109033478B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705251B (en) * 2019-10-14 2023-06-16 支付宝(杭州)信息技术有限公司 Text analysis method and device executed by computer
CN111160568B (en) 2019-12-27 2021-04-06 北京百度网讯科技有限公司 Machine reading understanding model training method and device, electronic equipment and storage medium
CN111782808A (en) * 2020-06-29 2020-10-16 北京市商汤科技开发有限公司 Document processing method, device, equipment and computer readable storage medium
CN112115892A (en) * 2020-09-24 2020-12-22 科大讯飞股份有限公司 Key element extraction method, device, equipment and storage medium
CN113935329B (en) * 2021-10-13 2022-12-13 昆明理工大学 Asymmetric text matching method based on adaptive feature recognition and denoising
CN115374239A (en) * 2022-07-13 2022-11-22 北京中海住梦科技有限公司 Legal and legal analysis method and device, computer equipment and readable storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106499A1 (en) * 2005-08-09 2007-05-10 Kathleen Dahlgren Natural language search system
CN102200975B (en) * 2010-03-25 2013-12-11 北京师范大学 Vertical search engine system using semantic analysis
CN103186662B (en) * 2012-12-28 2016-08-03 北京中油网资讯技术有限公司 A kind of dynamically public sentiment keyword abstraction system and method
CN103106262B (en) * 2013-01-28 2016-05-11 新浪网技术(中国)有限公司 The method and apparatus that document classification, supporting vector machine model generate
CN103838833B (en) * 2014-02-24 2017-03-15 华中师范大学 Text retrieval system based on correlation word semantic analysis
US9710547B2 (en) * 2014-11-21 2017-07-18 Inbenta Natural language semantic search system and method using weighted global semantic representations
CN107402960B (en) * 2017-06-15 2020-11-10 成都优易数据有限公司 Reverse index optimization algorithm based on semantic mood weighting
CN107491518B (en) * 2017-08-15 2020-08-04 北京百度网讯科技有限公司 Search recall method and device, server and storage medium

Also Published As

Publication number Publication date
CN109033478A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN109033478B (en) Text information rule analysis method and system for search engine
CN110399457B (en) Intelligent question answering method and system
CN107436864B (en) Chinese question-answer semantic similarity calculation method based on Word2Vec
CN112115238B (en) Question-answering method and system based on BERT and knowledge base
CN110334178B (en) Data retrieval method, device, equipment and readable storage medium
CN102262634B (en) Automatic questioning and answering method and system
CN109241534B (en) Examination question automatic generation method and device based on text AI learning
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN110674252A (en) High-precision semantic search system for judicial domain
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN104199965A (en) Semantic information retrieval method
CN104794169A (en) Subject term extraction method and system based on sequence labeling model
JP2006244262A (en) Retrieval system, method and program for answer to question
CN109446423B (en) System and method for judging sentiment of news and texts
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN112463944B (en) Search type intelligent question-answering method and device based on multi-model fusion
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN112256939A (en) Text entity relation extraction method for chemical field
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN113168499A (en) Method for searching patent document
CN112256845A (en) Intention recognition method, device, electronic equipment and computer readable storage medium
CN113196277A (en) System for retrieving natural language documents
CN111651602A (en) Text classification method and system
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant