WO2017091985A1 - 停用词识别方法与装置 - Google Patents

停用词识别方法与装置 Download PDF

Info

Publication number
WO2017091985A1
WO2017091985A1 PCT/CN2015/096179 CN2015096179W WO2017091985A1 WO 2017091985 A1 WO2017091985 A1 WO 2017091985A1 CN 2015096179 W CN2015096179 W CN 2015096179W WO 2017091985 A1 WO2017091985 A1 WO 2017091985A1
Authority
WO
WIPO (PCT)
Prior art keywords
query statement
word
query
feature
statement
Prior art date
Application number
PCT/CN2015/096179
Other languages
English (en)
French (fr)
Inventor
周文礼
王喆
胡斐然
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201580029727.5A priority Critical patent/CN108027814B/zh
Priority to PCT/CN2015/096179 priority patent/WO2017091985A1/zh
Priority to JP2017521535A priority patent/JP6355840B2/ja
Priority to EP15909502.5A priority patent/EP3232336A4/en
Publication of WO2017091985A1 publication Critical patent/WO2017091985A1/zh
Priority to US15/693,971 priority patent/US10019492B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and apparatus for identifying a stop word used in an information retrieval system, and a computing device.
  • An information retrieval system such as a search engine or a question answering system, retrieves relevant content required by the user according to a query sentence input by the user.
  • the user's input query may contain some words that have no practical meaning and high frequency. They are also called stop words.
  • the information retrieval system needs to be identified.
  • the stop word in the query statement is removed, and the part of the stop word is removed from the query statement to obtain keywords in the query statement, and the information retrieval system performs matching according to the obtained keywords to obtain relevant content required by the user.
  • stop word recognition generally relies on a stop word table manually edited by a vocabulary expert in advance, and the manually edited stop word table not only has a large overhead, but also relies on matching with the stop word list.
  • the way to identify stop words in an input statement is also unable to adapt to increasingly complex user search behavior.
  • the application provides a stop word recognition method, device and computer device to improve the recognition accuracy of stop words.
  • the first aspect of the present application provides a stop word recognition method, which is executed by an information retrieval system running on a computer device, comprising: receiving a first query statement, and acquiring a session identifier corresponding to the first query statement. :identify, abbreviation: ID); according to the obtained session ID, get with the first
  • ID belongs to the second query statement of the same session; the change feature of each word of the first query statement is obtained relative to the second query statement, and the change feature is used to represent each word of the first query statement in relation to the second query statement
  • Such changes such as new words, part of speech, the location of the words, punctuation at both ends of the word, etc., identify the stop words in the first query according to the changing characteristics of the words of the first query statement relative to the second query statement .
  • the second query statement is a previous query sentence input by the user before inputting the first query statement. Because the user searches through the information retrieval system, the change feature between the continuously input query statements can better reflect the user. The adjustment of the retrieval statement, therefore, the changing characteristics between successively entered query statements facilitates the identification of the stop words.
  • Obtaining a change statement of the word of the query statement and the to-be-processed query statement by acquiring a query statement that belongs to the same session as the to-be-processed query statement, and incorporating the change feature into the consideration factor for identifying the stop word in the to-be-processed query statement,
  • the word recognition can be performed according to the changing characteristics between the query sentences, and the recognition precision of the stop words is improved.
  • the obtained second query statement meets a combination of any two or more of any one of the following conditions: a first query statement and a The length of the longest common clause of the second query statement is greater than the first threshold; or the minimum number of operations required to convert the first query statement into the second query statement is less than the second threshold; or mapping the first query statement to the first vector, And mapping the first query statement to the second vector, the angle between the first vector and the second vector is less than a third threshold; or the length of the longest common clause of the first query statement and the second query statement, and The ratio of the sum of the lengths of a query statement and the second query statement is greater than a fourth threshold; or the length of the longest common clause of the first query statement and the second query statement, and the length of the first query statement and the second query statement The ratio of the length of the shorter one is greater than the fifth threshold; or the distance between the first vector and the second vector is less than the sixth threshold.
  • the search target may be changed in the process of using the information retrieval system by the user, and the change of the query statement used by the user when searching for different targets is generally large, and The variation characteristics between the two query statements for the same or similar retrieval target are more excellent for the recognition of the stop words. Therefore, in the plurality of query statements that belong to the same session as the first query statement, the second query statement is determined to be the same as the first query statement, and the query statement is identical to the first query statement. Probability of similar retrieval target Larger, the second query statement is used to extract the variation features of each word in the first query statement relative to the second query statement.
  • the method further includes: acquiring a first query according to the word feature database of each word query information retrieval system of the first query statement The statistical characteristics of each word of the statement; therefore, in the process of recognizing the stop word, not only according to the change characteristics of each word of the first query statement relative to the second query statement, but also according to the statistical feature of each word of the first query sentence The stop word in the first query.
  • the statistical characteristics of each word of the first query statement can also reflect the statistical parameters of each word in the file library, and adding the statistical feature to the recognition of the stop word can further improve the recognition accuracy of the stop word.
  • the statement feature of each word in the first query statement in the first query statement is also obtained, and according to the statistical feature of each word of the first query statement, each word of the first query statement is relative to the second query statement.
  • the change feature and the statement feature of each word in the first query statement in the first query statement identify the stop word in the first query statement to further improve the recognition accuracy of the stop word.
  • the statistical feature of each word of the first query statement and the change of each word of the first query statement relative to the second query statement comprises: inputting the change feature of each word of the first query statement with respect to the change feature of the second query sentence and the statistical feature of each word of the first query sentence, and obtaining the recognition model identification
  • the stop word in the first query statement, the recognition model is generally a piece of program code, and the function of the program code realizes the function of stop word recognition.
  • the method further includes: performing a statistical feature of the stop word in the first query statement and deactivating in the first query statement
  • the change feature of the word relative to the second query statement is used as a positive sample, and the statistical features of the words other than the stop word in the first query sentence and other words except the stop word in the first query statement are relative to the second query
  • the variation feature of the statement is used as a negative sample, and the recognition model is trained according to the positive sample and the negative sample.
  • the first query statement is removed from the stop word identification candidate search identified by the recognition model
  • the word is searched according to the candidate search term to obtain the search result; in the case of determining the correctness of the search result, the training is performed.
  • Determining the correctness of the retrieval result that is, by analyzing the operation information corresponding to the first query statement to determine the satisfaction degree of the user to the retrieval result corresponding to the first query statement, and selecting the recognition process of the stop word corresponding to the query sentence satisfied by the user Among them, the identified stop words and non-stop words, and the various features of these stop words and non-stop words are used to identify the training of the model, further improving the recognition accuracy of the recognition model.
  • a second aspect of the present application provides a stop word identification device, the device comprising an input module and a processing module, the input module is configured to receive the first query statement, and obtain a session ID corresponding to the first query statement.
  • a processing module configured to acquire, according to the session ID, a second query statement that belongs to the same session as the first query statement; obtain a change feature of each word of the first query statement relative to the second query statement, and the change feature is used to reflect the first query
  • Each word of the statement is identified in relation to various changes in the second query statement, such as part of speech, position of the word, punctuation at both ends of the word, etc., according to the changing characteristics of each word of the first query statement relative to the second query statement.
  • the stop word in the first query is used to implement the stop word recognition method provided by the first aspect.
  • a third aspect of the present application provides a computing device including a processor and a memory.
  • the computing device can implement the stop word recognition method provided by the first aspect, and the program code for implementing the stop word recognition method provided by the first aspect can be saved in a memory and executed by the processor.
  • a fourth aspect of the present application provides a storage medium capable of implementing the stop word recognition method provided by the first aspect when the program code stored in the storage medium is executed.
  • the program code is comprised of computer instructions that implement the stop word recognition method provided by the first aspect.
  • FIG. 1 is a schematic structural diagram of an information retrieval system according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of another information retrieval system according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a computing device according to an embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of a method for identifying a stop word according to an embodiment of the present invention
  • FIG. 5 is a schematic structural diagram of a stop word recognition apparatus according to an embodiment of the present invention.
  • stop words also known as stop words, refers to words that do not directly affect the statement representation or that affect minor words.
  • a query sentence input by a user does not help to search for relevant files.
  • Vocabulary such as “one” in the query “one basketball player Kobe”
  • “one” does not help to retrieve the relevant content that the user wants, so "one” can be regarded as a stop word in this scenario.
  • whether the same vocabulary is a stop word may have different judgments, such as the query "one world one dream", if "one” is also removed as a stop word. , the accuracy of the search results will be greatly affected.
  • the term "session” includes an interactive message between two or more devices over a period of time. If a session is established between the user and the server, the time interval begins when the user starts using the service, terminates when the user explicitly terminates the use of the service, or terminates the user for a certain period of time without interacting with the server. , for example, 30 minutes. Specifically, in the use environment of the information retrieval system, when a new session starts, the information retrieval system generates a new session ID and continuously receives the query statement sent by the user, when the information retrieval system does not receive the user for a certain period of time.
  • the new query statement the information retrieval system considers that the current session ends, all the query statements received by the information retrieval system during the beginning to the end of the session belong to the session, and the session ID and the query statement belonging to the session are stored in the In the historical query statement.
  • borderless language refers to a language in which there are no punctuation or spaces for demarcation between characters.
  • Common borderless languages include Chinese, Japanese, and the like.
  • the most common bordered languages include English.
  • FIG. 1 is an implementation of an information retrieval system 200, including a storage device 206 and a retrieval device 202.
  • the storage device 206 stores data required for the information retrieval system to perform retrieval, and the storage device 206 can establish communication with the retrieval device 202 through the communication network 204, and the storage device 206 also It can be directly disposed in the retrieval device 202, and communication is established with the retrieval device 202 through the input input unit 2021.
  • the retrieval device 202 includes an input and output unit 2021 and a processing unit 2022. After the user sends a query statement to the retrieval device 202 through the input input unit 2021, the retrieval device 202 searches according to the query statement to return the retrieval result corresponding to the user, general information.
  • the retrieval results of the retrieval system are presented to the user through a series of documents.
  • the input and output unit 2021 may be a network interface. If the user locally sends the query statement to the retrieval device 202 at the retrieval device 202, the input and output unit 2021 may also be the retrieval device 202.
  • Input/output English: input/output, abbreviation: I/O
  • the 2 is another implementation of the information retrieval system 200, including one or more retrieval devices 202, and further including one or more storage devices 206, each of which communicates with each of the storage devices 206 via a communication network,
  • the file library of the information retrieval system 200, the index file library, the historical query statement, the historical query log, the word feature database, and the like may be distributedly distributed in the respective storage devices 206.
  • the one or more retrieval devices 202 can form a distributed computing system to process the query statements.
  • the information retrieval system 200 can allocate the to-be-processed tasks to different retrieval devices 202 to perform the parallel processing capability of the information retrieval system 200 when the number of query statements to be processed is large and the load of the information retrieval system 200 is high. .
  • the information retrieval system 200 generally periodically updates the files that it can index and stores the files in the file library. After obtaining the updated files, the information retrieval system 200 assigns an ID to each file and builds an index. Common indexes include inversion. Index (English: inverted index), as shown in Table 1, the inverted index records the file ID of each word, the file index index is also called the index file.
  • the processing unit 2022 divides the query statement into a series of words. If the query statement is a borderless language, the process of acquiring the series of words is also called a word segmentation, for example, “Mobile shopping" participle is “mobile phone” (meaning mobile phone, pronounced as (214) (55)) and “shopping” (meaning purchase, pronounced kou(51)u(51)). If the query is in English, there is no need to segment the query in the process of obtaining a series of words. Directly distinguish different words based on the spaces in the query. Some of the series of words obtained may be stop words. In order to ensure the accuracy of the search results, the stop words in these words are recognized.
  • the words after the stop words are removed are matched with the index file, and the matching of each file matched in the index file is obtained, including the matching or sorting of the matched files, and the highest score or the highest sorting is finally obtained.
  • a certain number of files in front are returned to the user.
  • the accuracy of the retrieval result output by the information retrieval system 200 largely depends on the accuracy of the words matching the index file, so the accurate identification of the stop words is The performance of an information retrieval system is important.
  • the retrieval device 202 of Figure 1 or Figure 2 can be implemented by the computing device 400 of Figure 3.
  • the schematic diagram of the organization of the computing device 400 is as shown in FIG. 3, and includes a processor 402 and a memory 404.
  • the method further includes a bus 408 and a communication interface 406.
  • the communication interface 406 can be an implementation of the input/output unit 2021, and the processor 402 and Memory 404 can be an implementation of processing unit 2022.
  • the processor 402, the memory 404, and the communication interface 406 can implement communication connection with each other through the bus 408, and can also implement communication by other means such as wireless transmission.
  • the memory 404 memory may include a volatile memory (English: volatile memory), such as random access memory (English: random-access memory, abbreviation: RAM); the memory may also include non-volatile memory (English: non-volatile memory) ), such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory (English: flash memory), hard disk (English: hard disk drive, abbreviation: HDD) or solid state drive (English: solid-state Drive, abbreviation: SSD); the memory 404 may also include a combination of the above types of memory.
  • the memory 404 loads data such as historical query statements, historical query logs, word signatures, and the like in the storage device 206 for use by the processor 402.
  • the program code for implementing the stop word recognition method provided by FIG. 4 of the present invention may be stored in the memory 404 and executed by the processor 402.
  • the computing device 400 obtains the query statement through the communication interface 406, and returns the search result corresponding to the query statement to the user through the communication interface 406.
  • the processor 402 can be a central processing unit (English: central processing unit, abbreviation: CPU). After acquiring the first query statement, the processor 402 obtains a second query statement that belongs to the same session as the first query statement, and extracts a change feature of each word in the first query statement relative to the second query statement.
  • the change feature is used to indicate a change of a position, a part of speech, a punctuation mark, a grammar category, and the like of each word in the first query statement relative to the second query statement, and compare each word in the first query statement with respect to the second query.
  • the change feature of the statement is used to identify the stop word in the first query statement.
  • Obtaining a change statement of the word of the query statement and the to-be-processed query statement by acquiring a query statement that belongs to the same session as the to-be-processed query statement, and incorporating the change feature into the consideration factor for identifying the stop word in the to-be-processed query statement,
  • the word recognition can be performed according to the changing characteristics between the query sentences, and the recognition precision of the stop words is improved.
  • the number of query statements acquired by the processor 402 that belong to the same session as the first query statement may be multiple, and the search target may be replaced during the process of using the information retrieval system by the user, and the query statement used by the user when searching for different targets
  • the change is generally large, and the change feature between the two query statements belonging to the same or similar search target is more excellent for the recognition of the stop word, so the processor 402 is in the same session as the first query.
  • the query statement further performs screening to determine a second query statement that is less distinct from the first query statement, and then uses the second query statement to extract changes of each word in the first query statement relative to the second query statement. feature.
  • the processor 402 can also acquire the statistical features of each word of the first query statement, and input the statistical features and the changing features of each word into the recognition model to identify the stop words in the first query sentence.
  • the recognition model used by processor 402 can be a piece of program code that can be stored in memory 404, which is invoked when processor 404 trains the recognition model or uses a recognition model to identify a stop word.
  • the recognition model can also be implemented by hardware, and the processor 402 inputs the statistical features and the change characteristics of the respective words of the first query statement to the hardware, and the proprietary hardware returns to the processor 402 to identify the result, and the hardware can be field programmable.
  • Logic gate array (English: field-programmable gate array, abbreviation: FPGA).
  • the statistical characteristics of each word in the first query statement reflect the statistical information of each word in the first query sentence in the file library of the information retrieval system, and the statistical feature of each word is added to the process of identifying the stop word. To identify the stop word in the first query statement.
  • the present invention also provides a method for identifying a stop word.
  • the retrieval device 202 of FIG. 1 and FIG. 2 and the computing device 400 of FIG. 3 execute the stop word recognition method during operation, and a schematic flowchart thereof is shown in FIG. 4 .
  • Step 602 Receive a first query statement, and obtain a session ID corresponding to the first query statement.
  • the first query sentence received by the information retrieval system in this embodiment is "backstreet boys"
  • the 'the one' is used to obtain the session ID corresponding to the "backstreet boys' the one'".
  • the query is the first query of a new session
  • step 602. A session ID needs to be generated for the query statement.
  • the session ID obtained in step 602 is the existing session ID.
  • step 604 is further performed after step 602.
  • Step 604 Query the word feature database of the information retrieval system according to each word of the first query statement, and obtain the statistical features of each word of the first query sentence.
  • the words “backstreet”, “boys”, “the” and “one” are obtained from “backstreet boys' the one'”. If the query received in step 602 is a borderless language, the query statement needs to be performed. Word segmentation to get each word in the query. Obtaining the statistical characteristics of the four words, such as word frequency, word frequency mean, word frequency variance, etc.
  • the word feature database is an information retrieval system that counts various features of each word appearing in a file obtained within a certain number or a certain period of time. After the obtained, each word contained in the query statement can be queried in the word feature database to its corresponding various statistical feature values.
  • statistical feature n ⁇ , ⁇ ',6 ⁇ where ⁇ backstreet,1,statistical feature 1, statistical feature 2...statistical feature n ⁇ indicates that the first word of the query is "backstreet", where statistical feature 1 to statistical feature n is "backstreet” "The various statistical characteristics of the word.”
  • the statistical characteristics of each word in the first query statement reflect the statistical information of each word in the first query sentence in the file library of the information retrieval system, and analyzing the statistical characteristics of each word also helps to identify the stop words.
  • the information retrieval system After receiving the first query statement, the information retrieval system processes the first query sentence and converts it into a data structure, such as Query1[6][n+2] above, in addition to obtaining the statistical features of each word.
  • the statement feature of each word in the first query statement in the first query statement is also obtained, including the number of occurrences of each word in the first query statement, the part of each word, and the word of each word in the first query sentence.
  • the grammatical category the position of each word in the first query, whether the words are spaces before and after each word, whether each word is enclosed in quotation marks, etc. If there are m kinds of sentence features, then after obtaining "backstreet boys'the one'" Will be converted to Query1[6][n+m+2].
  • the statement features of the first query statement of each word of the first query statement reflect the characteristics of each word in the first query statement, and analyze each The statement characteristics of the word also help identify stop words.
  • Step 606 Acquire a second query statement that belongs to the same session as the first query statement according to the session ID corresponding to the first query statement.
  • the query statement obtained in step 606 may be the previous query statement before the query statement in step 602.
  • the historical query statement records the query statement to which the session corresponding to the session ID belongs, and the historical query statement may further include information of the query statement belonging to the same statement chain.
  • step 604 and step 606 may be interchanged or both may be processed in parallel.
  • the statistical features of the words obtained in the query statement in step 604 and the statement features, and the variation features of the words obtained in steps 606, 608, and 610 can be used in step 612, so the steps are performed.
  • step 604 and step 606 can be processed in parallel, wherein step 606 further includes step 608 and step 610.
  • step 608 is further performed after step 606.
  • Step 608 determining that the obtained second query statement meets any one of the following conditions: the length of the longest common clause of the first query statement and the second query statement is greater than the first threshold; or the first query statement is converted into the second query statement The minimum required operand is less than the second threshold; or mapping the first query statement to the first vector and mapping the first query statement to the second vector, the angle between the first vector and the second vector is less than the third a threshold; or a length of a longest common clause of the first query statement and the second query statement, a ratio of a sum of lengths of the first query statement and the second query statement is greater than a fourth threshold; or the first query statement and the second The ratio of the length of the longest common clause of the query statement to the length of the shorter one of the first query statement and the second query statement is greater than a fifth threshold; or the distance between the first vector and the second vector is less than the sixth Threshold.
  • the session to which the first query statement belongs may include a plurality of query statements. Therefore, the second query statement that satisfies the condition is further filtered out by the plurality of query statements in step 608, that is, the statement chain that can be combined with the first query statement is filtered out.
  • the criteria for screening can be based on any combination of any one or more of the following conditions. Determining that the second query statement and the first query statement meet any combination of any one or more of the following conditions, that is, determining that the second query statement and the first query statement can form a statement chain, if a statement chain can be formed Then, step 610 is continued. If a statement chain cannot be formed, step 612 is performed.
  • Condition three calculate the angle between "backstreet boys'the one'” and “the one backstreet boys” converted into a vector, if the angle is less than the third threshold, then "backstreet boys'the one'” and “the one backstreet boys "Can form a chain of statements.
  • Common methods for converting query statements into vectors include: 1. Establishing a vector space model (English: vector space modal, abbreviated: VSM). Each word in the word feature database is used as a dimension, and the number of dimensions in the VSM is equal to the word feature. The number of words in the library maps “the one backstreet boys” to the VSM, and the dimensions corresponding to the four words “the”, “one”, “backstreet”, and “boys” in the VSM will be assigned.
  • the value can represent the occurrence of the word or the statistical characteristics of the word.
  • Condition 4 differs from Condition 1 in that the length of the longest common clause of "backstreet boys'the one'” and “the one backstreet boys” is judged as “backstreet boys'the one'” and “the one backstreet boys”
  • the ratio of the sum of the lengths is related to the magnitude of the fourth threshold. If the ratio is greater than the fourth threshold, it can be judged that "backstreet boys'the one'” and "the one backstreet boys” belong to the same sentence chain.
  • the length of the query statement is the number of words contained in the query statement.
  • Condition 5 differs from Condition 1 in that the length of the longest common clause of "backstreet boys'the one'” and “the one backstreet boys” is judged as “backstreet boys'the one'” and “the one backstreet boys”
  • the ratio of the length of the shorter middle length to the magnitude of the fifth threshold if the ratio is greater than the fifth threshold, it can be judged that "backstreet boys' the one'" and "the one backstreet boys” belong to the same sentence chain.
  • condition six The difference between condition six and condition three is that after converting "backstreet boys'the one'” and “the one backstreet boys” into vectors, the distance between the two vectors is calculated. If the distance is less than the sixth threshold, "backstreet boys” 'the one'” and “the one backstreet boys” can form a chain of statements.
  • the distance between the two vectors in condition six can be the Euclidean distance.
  • step 608 can use any two or more of the six parameters in any combination, for example, after the six parameters are weighted and then summed, a total parameter is obtained, and then the total parameter and the threshold are used. As a comparison to determine that two query statements can form a statement chain.
  • the change of the query sentence used by the user when searching for different targets is generally large, and the variation characteristics between the two query statements belonging to the same or similar retrieval target are
  • the recognition effect of the stop word is more excellent, so the above six conditions are essentially used to determine that the difference between the first query statement and the second query statement is small, so as to obtain the first query statement and the first or similar search target.
  • step 608 if it is determined that two query statements can form a statement chain, the result of step 608 can be stored in the historical query statement.
  • the storage format is, for example, statement chain 1: query statement A, query statement B; statement chain 2: query statement C, query statement D.
  • statement chain 2 query statement C, query statement D.
  • Statement D, query statement E can also create statement chain 4: query statement C, query statement D, query statement E, of course, can also replace statement chain 2 with statement chain 4, so the information chain stored in the information retrieval system It will be further enriched, and the characteristics of changes between subsequent extracted query statements are also more abundant.
  • Step 610 Acquire a change feature of each word of the first query statement relative to the second query statement.
  • the variation of each word of the first query statement relative to the second query statement indicates various changes of each word in the first query statement relative to the second query statement.
  • the change feature of each word of the first query statement relative to the second query statement includes any one of the following: a first change feature, configured to indicate whether each word of the first query statement is new relative to the second query statement
  • the second variation feature is configured to indicate a word included in the first query statement and the second query statement, a change in a position of the first query statement relative to a position in the second query statement; and a third variation feature, And indicating a word included in the first query statement and the second query statement, the part of speech in the first query statement is changed relative to the part of speech in the second query statement; and the fourth variation feature is used to indicate the first query statement and
  • the second query statement contains the words, the grammatical category in the first query statement is relative to a change of a grammar category in the second query statement; a fifth change feature, configured to indicate a word included in the first query statement and the second query statement, the punctuation marks at both ends in the first query statement are relative to the second query The change in punctuation at both ends of the statement.
  • the first variation feature is configured to indicate whether each word of the first query statement is a new word relative to the second query statement.
  • the first query statement is less likely to be a stop word than the second query statement.
  • the second query statement is “backstreet boys”, and the first query statement is “backstreet boys the one”, then "the one" is the new word of the first query statement relative to the second query statement.
  • word 11 in Query A is a word or punctuation
  • m 11 is the position where the word appears in A
  • m 12 is the grammatical category in which the word appears in A
  • m 13 is the part of speech of the word
  • m 14 to m 1n other statistical features 11 indicates the word or sentence feature word 11 in the first query statement.
  • Comparing Query A with Query B it can be determined whether each word in Query A is new relative to Query B, and it can be recognized whether each word in the first query statement is a new word relative to the second query statement.
  • a second variation feature for indicating a change of a position of each word in the first query statement relative to a position of the second query statement in the first query statement, generally if a position of the word in the first query statement is relative to If the position in the second query is moved, it indicates that the word is of higher importance, that is, the word is less likely to be a stop word.
  • the third variation feature the words included in the first query statement and the second query statement, the part of speech in the first query statement is changed relative to the part of speech in the second query statement, and the words of different part of speech are the possibility of stop words Different in nature, for example, the probability that the general noun is a stop word is lower than the probability that the adjective is a stop word, if different eigenvalues are assigned to different part of speech, if the part of the word in the first query is relative to the second query.
  • the third variation feature can be the word in the first query statement
  • the feature value of the part of speech is subtracted from the feature value of the part of speech in the second query statement.
  • the fourth variation feature the word included in the first query statement and the second query statement, the grammatical category in the first query statement is changed relative to the grammatical category in the second query statement, and the words of different grammar categories are disabled
  • the possibility of words is different. If different grammatical categories, such as subject, predicate, and object, are given different eigenvalues, if the grammatical category of a word in the first query statement is changed relative to the grammatical category in the second query statement, Then, the fifth variation feature may be the feature value of the grammatical category of the word in the first query sentence minus the eigenvalue of the grammatical category in the second query statement.
  • the fifth variation feature the word included in the first query statement and the second query statement, the punctuation marks at both ends in the first query statement are different from the punctuation marks in the second query statement. If the word punctuation at both ends of the word is different, the possibility of the word being a stop word is different. For example, a word has more spaces in the first query than in the second query or the word is double quoted or If the single quote is enclosed, it indicates that the word is a lower stop word, and the sixth change feature may also indicate that the replacement word is a lower stop word.
  • the second query statement is "backstreet boys the one”
  • the first query statement is “backstreet boys' the one'”
  • the one is enclosed in quotation marks at both ends in the first query statement, "the one”
  • the term is less likely to be a stop word.
  • step 608 If it is determined in step 608 that "backstreet boys 'the one'” and “the one backstreet boys” can constitute a statement chain, the change characteristics of "backstreet boys 'the one'" relative to "the one backstreet boys” are obtained.
  • the statement chain including the redundant two query statements may be obtained, in step 610, not only the backstreet may be acquired.
  • the boyfriend'the one's relative feature of the "the one backstreet boys” may also acquire the changing characteristics of the "backstreet boys' the one'" relative to other query statements, and may obtain the acquired changing features along with the step 608.
  • the result of the statement chain is stored in the historical query statement to be used when processing the same query again.
  • Query.isInTheSameQueryChain() ⁇ Query M.sessionID, Query N.sessionID ⁇ Query.isInTheSameQueryChain() ⁇ is defined as the query statement based on the input. It is determined whether the two query statements belong to the same statement chain, that is, the isInTheSameQueryChain() ⁇ includes a method of determining in step 608 that the first query statement and the second query statement belong to the same statement chain.
  • the first variation feature above can be implemented with the Word.newWord() function. After we know that Query1 and Query2 belong to the same statement chain, we can run newWord() for each word in Query1.Word. The function determines whether each word in the first query statement is a new word relative to the second query statement. The acquisition of the remaining variation features is similar, and various variations of each word in Query1 relative to Query2 can be obtained according to the various functions defined in Query1.Word.
  • Step 612 according to the change characteristics of each word of the first query statement relative to the second query statement Don't stop words in the first query.
  • each change feature of each word of the first query sentence is assigned a value. If the value of each change feature is higher, the word is a stop word. The higher the probability, if the weighted sum of the values of the respective variation features of any word of the first query statement is greater than the preset threshold, the word is recognized as a stop word, and if not greater than the preset threshold, the word Recognized as a non-stop word.
  • step 602 further includes step 604, that is, obtaining the statistical feature of each word of the first query statement
  • step 612 may be based on the statistical features of each word of the first query statement and each of the first query statement.
  • the words identify the stop words in the first query statement relative to the changing characteristics of the second query statement.
  • step 612 may be based on the statistical features of each word of the first query statement and each of the first query statements.
  • the statement feature in the first query statement and the change feature of each word of the first query statement relative to the second query statement identify the stop word in the first query sentence, further improving the accuracy of the stop word recognition.
  • the information retrieval system is further provided with a recognition model, which inputs the statistical features of each word of the first query statement, various words of the first query statement with respect to various changes of the second query statement, and inputs the recognition model to identify the first Whether each word in a query is a stop word.
  • the recognition model may be a threshold model. For example, if the result of the weighted summation of the values of the respective variation features of a word and the statistical features is greater than a preset threshold, the word is recognized as a stop word if not greater than the preset.
  • the threshold is recognized as a non-stop word, and can also be a learning model such as a decision tree or a neural network.
  • step 604 further includes the step 604, if the first query statement cannot form a statement chain with any of the historical query statements, the change feature of the first query statement relative to the second query statement cannot be obtained, then only the step 612 is performed.
  • the stop words in the first query statement are identified according to the statistical characteristics of the respective words of the first query statement.
  • the method provided by the embodiment adds the change feature between the query sentences to the process of the stop word recognition, so that the information retrieval system can better perform the stop word recognition according to the change of the query sentence input by the user, thereby avoiding In the traditional method of identifying stop words by stopping the vocabulary, the change of the query sentence brought by the user's adjustment of the query sentence cannot be included in the error caused by the stop word recognition process.
  • the method further includes a step 614, the statistical feature of the stop word in the first query statement and the change feature of the stop word in the first query statement relative to the change feature of the second query statement as a positive sample, and the first query statement
  • the statistical characteristics of other words except the stop word and the change characteristics of the other words except the stop word in the first query statement relative to the second query sentence are used as negative samples, and the recognition model is trained according to the positive sample and the negative sample.
  • the recognition model After identifying the stop word in the first query statement, using the statistical feature of the stop word in the first query statement and the change feature of the stop word in the first query statement relative to the second query statement as a positive sample, The statistical features of the words other than the stop words in the first query statement and the change features of the other words in the first query sentence except the stop words are used as negative samples, and the recognition model is trained to When the recognition model subsequently performs the stop word recognition, if the received word is classified into a positive sample, the recognition model recognizes the word as a stop word if the received word is classified into a negative sample. , indicating that the recognition model did not recognize the word as a stop word. Improve the accuracy of the recognition model by training the recognition model.
  • step 612 according to the statistical feature of each word of the first query statement, the statement feature of each word in the first query statement in the first query statement, and the words of the first query statement relative to the second query statement Changing the feature to identify the stop word in the first query statement, in step 614, the statement feature in the first query statement in the first query statement is used as a positive sample, and the first query statement is disabled.
  • the other words outside the word are trained in the recognition model in the first query statement as a negative sample to further improve the accuracy of the recognition model.
  • the information retrieval system may store the obtained positive samples and negative samples, so as to accumulate a certain time or accumulate a certain number of positive samples and negative samples, step 614 is performed.
  • the first query statement is removed from the stop word identified by the recognition model to obtain a candidate search term, and the search result is obtained according to the candidate search term; and the correctness of the search result is determined, and the execution is performed. training.
  • the relevant features of the retrieval process include the statistical features of each word of the first query statement and the words of the first query statement relative to the second query statement
  • the file corresponding to the file ID is returned to the user, and the operation information performed by the user on the files is also recorded in the historical query log, and the operation information includes: after the user obtains each file obtained by the query, The operation information of the file, such as which files the user clicked, the time when the click action occurred, the browsing time in each file, etc., wherein the interval between the time when one file is clicked and the time when the next file is clicked is generally considered to be the former The browsing time of a file.
  • the correctness of the search result corresponding to the first query statement is determined, that is, the operation information corresponding to the first query statement is analyzed to determine the degree of satisfaction of the user with the search result corresponding to the first query statement. For example, if the search result of the first query statement is found, if the user clicks on one of the files and does not click other files within 60 seconds, it can be considered that the user has found the required file in the search result of this time, so
  • the words identified as stop words in the first query sentence and their corresponding search process related features may be used as positive samples of the recognition model, words not identified as stop words in the first query sentence and corresponding retrieval processes thereof Related features can be used as a negative sample of the recognition model to train the recognition model practice.
  • the setting of the screening condition for determining the correctness of the retrieval result corresponding to the first query statement may be various, except that the filtering condition that the time when the user does not act after the user clicks the file exceeds the threshold value may also be determined by determining the user in the current retrieval. The number of files clicked in the file exceeds a threshold, etc., and the setting of the filter condition or event can indicate that the user approves the accuracy of the file in the search result.
  • the training of the recognition model generally needs to accumulate a certain number of positive samples and negative samples before proceeding. Therefore, when a certain number of query statements are accumulated in the historical query log or after a preset time, each query sentence in the historical query log is passed. The operation information of the corresponding search result is analyzed, and the relevant features of the search process corresponding to the query sentence suitable as the training data of the recognition model are mined, and the recognition model is trained.
  • the user's operation on each file in the retrieval result reflects the user's judgment on the correctness of the retrieval result, and also reflects the accuracy of the recognition result of the stop word corresponding to the retrieval result.
  • the analysis user analyzes the historical query log corresponding to each query statement, and can know that the user is satisfied with the retrieval result of the query statement, and the parameter and result related to the stop word recognition corresponding to the partial query statement can be used for identification. Model training.
  • word recognition has a greater effect.
  • the term “merchandise” is used more frequently and may not indicate a special meaning, so when the information retrieval system uses "merchandise” as a stop
  • the relevant features of the search process of the word "merchandise” can be used as a positive sample to train the recognition model.
  • Step 616 updating the word feature database of the information retrieval system according to the new file.
  • step 616 and the execution of steps 602 through 614 may be independent of each other, ie, the update of the word feature library and the training of the recognition model may be performed in parallel.
  • Steps 614 and 616 can be performed online (as the user inputs the query statement), or can be performed offline (for example, the system is idle, or the system is centrally maintained, and the system is updated), especially since the execution of steps 614 and 616 requires a certain amount.
  • the update of the historical query log of the time or the accumulation of the update of the file, step 614 and step 616 are performed offline. To avoid online processing pressure on the information retrieval system.
  • the stop word recognition method obtained in this embodiment obtains a query statement that belongs to the same session as the query statement to be processed, and then acquires a change feature of the word of the query statement and the to-be-processed query statement, and incorporates the change feature into the identifier to be processed.
  • the word recognition can be performed according to the change characteristics between the query sentences, and the recognition accuracy of the stop words is improved, that is, the search output by the information retrieval system is improved. The accuracy of the results.
  • the embodiment of the present invention further provides a stop word recognition device 800, which can be implemented by the retrieval device 202 shown in FIG. 1 or FIG. 2, and can also be implemented by the computing device 400 shown in FIG. It can be realized by an application-specific integrated circuit (ASIC) or a programmable logic device (abbreviated as PLD).
  • the PLD may be a complex programmable logic device (CPLD), an FPGA, a general array logic (GAL), or any combination thereof.
  • the stop word recognition device 800 is used to implement the stop word recognition method shown in FIG.
  • the schematic diagram of the organization structure of the stop word recognition device 800 includes an input module 802 and a processing module 804.
  • steps 604 to 616 in the stop word recognition method shown in FIG. 4 are executed.
  • the input module 802 is configured to receive the first query statement, obtain the session ID corresponding to the first query statement, that is, perform step 602 in the stop word recognition method shown in FIG. 4 .
  • the processing module 804 is configured to obtain, according to the session ID corresponding to the first query statement, a second query statement that belongs to the same session as the first query statement, and is further configured to obtain each word of the first query statement relative to the second query statement.
  • the change feature is further configured to identify the stop word in the first query statement according to the change feature of each word of the first query statement relative to the second query statement.
  • the second query statement and the first query statement obtained by the processing module 804 can form a statement chain, and the judgment condition of the statement chain includes: the length of the longest common clause of the first query statement and the second query statement is greater than the first threshold; or The minimum number of operations required to convert the first query statement into the second query statement is less than the second threshold; or mapping the first query statement to the first vector and mapping the first query statement to the second vector, the first vector and the first The angle between the two vectors is less than the third threshold; or the length of the longest common clause of the first query statement and the second query statement, and the length of the first query statement and the second query statement.
  • the ratio of the sum is greater than the fourth threshold; or the ratio of the length of the longest common clause of the first query statement to the second query statement, and the length of the shorter one of the first query statement and the second query statement is greater than the fifth a threshold; or a distance between the first vector and the second vector is less than a sixth threshold.
  • the processing unit 804 further queries the word feature database of the information retrieval system according to each word of the first query statement, and acquires statistical features of each word of the first query statement; and changes the words of the first query statement relative to the second query statement.
  • the feature and the statistical feature of each word of the first query statement are input to the recognition model, and the stop word in the first query sentence identified by the recognition model is obtained.
  • the recognition model is typically a piece of code that is processed by the processing unit 804 when the recognition model is trained, or when the recognition model is used to identify the stop word.
  • the processing unit 804 further removes the stop word recognition candidate search term identified by the recognition model, and searches for the search result according to the candidate search term; and determines the correctness of the search result, the first query statement
  • the statistical feature of the stop word in the stop word and the change feature of the stop word in the first query statement relative to the second query sentence as a positive sample, the statistical features of the other words in the first query sentence except the stop word and the A change phrase of a word other than the stop word in a query statement relative to the change character of the second query sentence is used as a negative sample, and the recognition model is trained according to the positive sample and the negative sample.
  • Determining the correctness of the retrieval result corresponding to the first query statement that is, analyzing the operation information corresponding to the first query statement to determine the satisfaction degree of the user to the retrieval result corresponding to the first query statement, and correspondingly satisfying the retrieval result of the user
  • various features of the identified stop words and non-stop words are used to identify the training of the model to further improve the accuracy of the recognition model.
  • the stop word identification device provided in this embodiment is capable of acquiring a query statement that belongs to the same session as the query statement to be processed, and then acquiring a change feature of the query sentence and the word of the query to be processed, and incorporating the change feature into the identifier to be processed.
  • the word recognition can be performed according to the changing characteristics between the query sentences, and the recognition precision of the stop words is improved, that is, the output of the information retrieval system is improved. The accuracy of the search results.
  • the methods described in connection with the present disclosure may be implemented by a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, which can be stored in RAM, flash memory, ROM, erasable programmable read only memory (English: erasable programmable read only Memory, abbreviation: EPROM), electrically erasable programmable read only memory (EEPROM), hard disk, optical disk, or any other form of storage medium known in the art.
  • the functionality described herein can be implemented in hardware or software.
  • the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium.
  • a storage medium may be any available media that can be accessed by a general purpose or special purpose computer.

Abstract

一种停用词识别方法,涉及计算机技术领域。在该方法中,获取了用户输入的第一查询语句后,获取与该查询语句属于相同会话的第二查询语句,并根据第一查询语句中的各个词相对于第二查询语句的变化特征来识别第一查询语句中的停用词。本方法可以更加精确的识别查询语句中的停用词,提升停用词的识别精度。

Description

停用词识别方法与装置 技术领域
本发明涉及计算机技术领域,尤其涉及一种运用于信息检索系统的停用词识别方法与装置以及一种计算设备。
背景技术
信息检索系统,例如搜索引擎或问答(英文:question answering)系统,根据用户的输入的查询语句,检索出用户所需的相关内容。用户的输入的查询语句,可能包含有一部分没有实际意义且出现频率较高的词汇,又称之为停用词(英文:stop word),信息检索系统为了提升检索的效率以及准确性,需要识别出查询语句中的停用词,并将该部分停用词从查询语句中去除以获得查询语句中的关键词,信息检索系统再根据获取的关键词进行匹配,获取用户所需的相关内容。
随着信息检索系统的普遍使用以及智能化,越来越多的用户使用自然和半自然的语言方式来输入查询语句进行搜索,因此对于信息检索系统的停用词识别能力的要求也越来越高,现有技术中停用词识别一般依靠预先由词汇领域专家人工编辑的停用词表来实现,而人工编辑的停用词表不仅制作开销较大,并且依靠与停用词表的匹配来识别输入语句中停用词的方法也无法适应越来越复杂的用户搜索行为。
发明内容
本申请提供了一种停用词识别方法、装置以及计算机设备,以提升停用词的识别精度。
本申请的第一方面提供了一种停用词识别方法,该方法由运行在计算机设备上的信息检索系统执行,包括:接收第一查询语句,并获取第一查询语句对应的会话标识(英文:identify,缩写:ID);根据获取的会话ID,获取与第一 查询语句属于同一会话的第二查询语句;获取第一查询语句的各个词相对于第二查询语句的变化特征,变化特征用于体现第一查询语句的各个词在相对于第二查询语句的各种变化,例如新增词、词性、词所在的位置、词两端的标点符号等,根据第一查询语句的各个词相对于第二查询语句的变化特征,识别第一查询语句中的停用词。
可选的,第二查询语句为用户在输入第一查询语句之前输入的上一查询语句,由于用户通过信息检索系统进行检索的过程中,连续输入的查询语句之间的变化特征更能够体现用户对检索语句的调整,因此连续输入的查询语句之间的变化特征有助于停用词的识别。
通过获取与待处理查询语句属于相同会话的查询语句,随后获取该查询语句与待处理查询语句的词的变化特征,并将该变化特征纳入识别待处理查询语句中的停词的考虑因素中,使得停词识别的过程中,能够根据查询语句之间的变化特征进行停词识别,提升了停用词的识别精度。
结合第一方面,在第一方面的第一种实现方式中,获取的第二查询语句符合以下条件之任一或以下条件中任意两个或多个之间的组合:第一查询语句与第二查询语句的最长公共子句的长度大于第一阈值;或者第一查询语句转换为第二查询语句所需的最少操作数小于第二阈值;或者将第一查询语句映射为第一向量,并将第一查询语句映射为第二向量,第一向量与第二向量之间的夹角小于第三阈值;或者第一查询语句与第二查询语句的最长公共子句的长度,与第一查询语句和第二查询语句的长度之和的比值大于第四阈值;或者第一查询语句与第二查询语句的最长公共子句的长度,与第一查询语句和第二查询语句中长度较短者的长度的比值大于第五阈值;或者第一向量与第二向量之间的距离小于第六阈值。
与第一查询语句属于相同会话的查询语句可以有多个,而由于用户使用信息检索系统的过程中可能会更换检索目标,而用户在检索不同目标时使用的查询语句的变化一般较大,而针对相同或相似检索目标的两个查询语句之间的变化特征对于停用词的识别效果更加优良。因此在与第一查询语句属于相同会话的多个查询语句中进一步进行甄别,确定与第一查询语句相比区别较小的第二查询语句,这类查询语句与第一查询语句有着针对相同或相似检索目标的概率 较大,再将该第二查询语句用于提取第一查询语句中的各个词相对于第二查询语句的变化特征。
结合第一方面或第一方面的第一种实现方式,在第一方面的第二种实现方式中,还包括根据第一查询语句的各个词查询信息检索系统的词特征库,获取第一查询语句的各个词的统计特征;因此停用词的识别过程中,不仅仅根据第一查询语句的各个词相对于第二查询语句的变化特征,还根据第一查询语句的各个词的统计特征识别第一查询语句中的停用词。
第一查询语句的各个词的统计特征也能体现各个词在文件库的统计参数,将统计特征加入到停用词的识别中能进一步提升停用词的识别精度。
可选的,还获取了第一查询语句中各个词在第一查询语句中的语句特征,并根据第一查询语句的各个词的统计特征、第一查询语句的各个词相对于第二查询语句的变化特征和第一查询语句中各个词在第一查询语句中的语句特征来识别第一查询语句中的停用词,以进一步提升停用词的识别精度。
结合第一方面的第二种实现方式,在第一方面的第三种实现方式中,根据第一查询语句的各个词的统计特征和第一查询语句的各个词相对于第二查询语句的变化特征,识别第一查询语句中的停用词包括:将第一查询语句的各个词相对于第二查询语句的变化特征和第一查询语句的各个词的统计特征输入识别模型,获得识别模型识别出的第一查询语句中的停用词,识别模型一般为一段程序代码,该程序代码运行时实现停用词识别的功能。
结合第一方面的第三种实现方式,在第一方面的第四种实现方式中,该方法还包括,将第一查询语句中的停用词的统计特征和第一查询语句中的停用词相对于第二查询语句的变化特征作为正样本,将第一查询语句中除停用词外的其他词的统计特征和第一查询语句中除停用词外的其他词相对于第二查询语句的变化特征作为负样本,根据正样本和负样本对识别模型进行训练。
结合第一方面的第四种实现方式,在第一方面的第五中实现方式中,在进行识别模型的训练之前,将第一查询语句去除所述识别模型识别出的停用词获得候选检索词,根据候选检索词进行检索获得检索结果;在确定检索结果的正确性的情况下,执行训练。
确定检索结果的正确性,即通过对第一查询语句对应的操作信息进行分析以确定用户对第一查询语句对应的检索结果的满意程度,选取用户满意的查询语句对应的停用词的识别过程中,被识别的停用词和非停用词,并将这些停用词和非停用词的各种特征用于识别模型的训练,进一步提升识别模型的识别精度。
本申请的第二方面提供了一种停用词识别装置,该装置包括输入模块和处理模块,输入模块用于接收第一查询语句,获取第一查询语句对应的会话ID。处理模块,用于根据会话ID,获取与第一查询语句属于同一会话的第二查询语句;获取第一查询语句的各个词相对于第二查询语句的变化特征,变化特征用于体现第一查询语句的各个词在相对于第二查询语句的各种变化,例如词性、词所在的位置、词两端的标点符号等,根据第一查询语句的各个词相对于第二查询语句的变化特征,识别第一查询语句中的停用词。该装置用于实现第一方面提供的停用词识别方法。
本申请的第三方面提供了一种计算设备,包括处理器、存储器。该计算设备运行时能够实现第一方面提供的停用词识别方法,用于实现第一方面提供的停用词识别方法的程序代码可以保存在存储器中,并由处理器来执行。
本申请的第四方面提供了一种存储介质,该存储介质中存储的程序代码被执行时能够实现第一方面提供的停用词识别方法。该程序代码由实现第一方面提供的停用词识别方法的计算机指令构成。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作以简单地介绍,显而易见的,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的信息检索系统的组织结构示意图;
图2为本发明实施例提供的又一信息检索系统的组织结构示意图;
图3为本发明实施例提供的计算设备的组织结构示意图;
图4为本发明实施例提供的停用词识别方法的流程示意图;
图5为本发明实施例提供的停用词识别装置的组织结构示意图。
具体实施方式
下面结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述。
贯穿本说明书,术语“停用词”,又称为停词,指代语句中对语句表述不构成直接影响的或影响微小的词汇,例如用户输入的查询语句内对搜索出相关文件无帮助的词汇,如查询语句“one basketball player Kobe”中“one”对于检索出用户想要的相关内容并无帮助,因此在此场景下“one”可以被视为停用词。应当说明的是,在不同语境、应用场景下,同一词汇是否是停用词可能会有着不同的判断,例如查询语句“one world one dream”中,如果把“one”还作为停用词去除,则搜索结果的准确性将会受到很大影响。
贯穿本说明书,术语“会话”(英文:session)包括一段时间间隔内两个或两个以上设备之间的交互消息。如果一个会话建立在用户和服务器之间,则该时间间隔起于用户开始使用某服务的时间,终止于用户明确终止使用这个该服务的时间,或者是终止于用户一定时间内不与该服务器交互,比如30分钟。具体到信息检索系统的使用环境中,当一个新的会话开始后,信息检索系统生成新的会话ID并持续接收到用户发来的查询语句,当信息检索系统连续一定时间未收到用户发来的新的查询语句,则信息检索系统认为当前会话结束,该会话开始至结束期间信息检索系统接收到的全部查询语句均属于该会话,并且该会话ID与属于该会话的查询语句会被存储于历史查询语句中。
贯穿本说明书,术语“无边界语言”指代字符间没有用于划定界限的标点符号或空格的语言,常见的无边界语言包括中文、日文等。相应的,最常见的有边界语言包括英文。
本发明实施例所应用的信息检索系统的架构图
图1为信息检索系统200的一种实现方式,包括存储设备206、检索设备202构成。其中存储设备206中存储了信息检索系统进行检索时需要的数据,存储设备206可以通过通信网络204与检索设备202建立通信,存储设备206也 可以直接设置在检索设备202中,通过输入输入单元2021与检索设备202建立通信。检索设备202中包括输入输出单元2021和处理单元2022,用户通过输入输入单元2021向检索设备202发送一个查询语句后,检索设备202根据该查询语句进行检索以返回给用户对应的检索结果,一般信息检索系统的检索结果通过一系列的文件展现给用户。如果用户通过通信网络204向检索设备202发送查询语句,则输入输出单元2021可以为网络接口,如果用户在检索设备202本地向检索设备202发送查询语句,则输入输出单元2021还可以为检索设备202的输入/输出(英文:input/output,缩写:I/O)接口。
图2为信息检索系统200的另一种实现方式,包括一个或多个检索设备202,还包括一个或多个存储设备206,各个检索设备202和各个存储设备206之间通过通信网络实现通信,信息检索系统200的文件库、索引文件库、历史查询语句、历史查询日志、词特征库等数据可以分布式部署于各个存储设备206中。一个或多个检索设备202可以组成分布式计算系统对查询语句进行处理。该信息检索系统200在待处理的查询语句的数量较大,信息检索系统200的负载较高时,能够将待处理任务分配至不同检索设备202上执行,以提升信息检索系统200的并行处理能力。
信息检索系统200一般周期性的更新其能够索引到的文件并将这些文件存储于文件库中,获取更新的文件后,信息检索系统200为各个文件分配ID并建立索引,常见的索引包括倒排索引(英文:inverted index),如表1所示,倒排索引中记录了各个词所在的文件ID,记录索引的文件也称为索引文件。
词1 文件1 文件6
词2 文件3 文件4
词n 文件5 文件9
表1
检索设备202通过输入输出单元2021获取了查询语句后,处理单元2022将查询语句分为一系列的词,如果查询语句为无边界语言,则获取该一系列词的过程也称为分词,例如将“手机购物”分词为“手机”(意思为移动电话,发音为
Figure PCTCN2015096179-appb-000001
(214)
Figure PCTCN2015096179-appb-000002
(55))和“购物”(意思为购买,发音为kou(51)u(51)两个词),如果查询语句为英文,则该获取一系列词的过程中无须再对查询语句进行分词, 直接根据查询语句中的空格来区别不同的词。获取的一系列词中的一部分可能为停用词,为了保证检索结果的准确程度,接下来将这些词中的停用词识别出来。随后将去除了停用词后的词与索引文件进行匹配,并获取该查询语句在索引文件中匹配的各个文件的匹配情况,包括匹配的各个文件的评分或排序,最后将评分最高或排序最靠前的一定数量的文件返回给用户。
通过信息检索系统200的工作流程可以看出,信息检索系统200输出的检索结果的准确与否,很大程度上依赖于与索引文件进行匹配的词的准确性,因此停用词的准确识别对于信息检索系统的性能很重要。
图1或图2中的检索设备202可以通过图3中的计算设备400实现。计算设备400的组织结构示意图如图3所示,包括处理器402、存储器404,还可以包括总线408、通信接口406,通信接口406可以为输入输出单元2021的一种实现方式,处理器402和存储器404可以为处理单元2022的一种实现方式。
其中,处理器402、存储器404和通信接口406可以通过总线408实现彼此之间的通信连接,也可以通过无线传输等其他手段实现通信。
存储器404存储器可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random-access memory,缩写:RAM);存储器也可以包括非易失性存储器(英文:non-volatile memory),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器(英文:flash memory),硬盘(英文:hard disk drive,缩写:HDD)或固态硬盘(英文:solid-state drive,缩写:SSD);存储器404还可以包括上述种类的存储器的组合。计算设备400运行时,存储器404加载存储设备206中的历史查询语句、历史查询日志、词特征库等数据,以供处理器402使用。在通过软件来实现本发明提供的技术方案时,用于实现本发明图4提供的停用词识别方法的程序代码可以保存在存储器404中,并由处理器402来执行。
计算设备400通过通信接口406获取查询语句,当获取查询语句对应的检索结果后通过通信接口406返回给用户。
处理器402可以为中央处理器(英文:central processing unit,缩写:CPU)。处理器402获取第一查询语句后,获取与第一查询语句属于相同会话的第二查询语句,提取第一查询语句中的各个词相对于第二查询语句的变化特征, 变化特征用于指示第一查询语句中的各个词相对于第二查询语句的位置、词性、两端标点符号、语法类别等的变化,并将第一查询语句中的各个词相对于第二查询语句的变化特征用于识别第一查询语句中的停用词。
通过获取与待处理查询语句属于相同会话的查询语句,随后获取该查询语句与待处理查询语句的词的变化特征,并将该变化特征纳入识别待处理查询语句中的停词的考虑因素中,使得停词识别的过程中,能够根据查询语句之间的变化特征进行停词识别,提升了停用词的识别精度。
处理器402获取的取与第一查询语句属于相同会话的查询语句可以有多个,而由于用户使用信息检索系统的过程中可能会更换检索目标,而用户在检索不同目标时使用的查询语句的变化一般较大,而属于针对相同或相似的检索目标的两个查询语句之间的变化特征对于停用词的识别效果更加优良,因此处理器402在与第一查询语句属于相同会话的多个查询语句中进一步进行甄别,确定与第一查询语句相比区别较小的第二查询语句,再将该第二查询语句用于提取第一查询语句中的各个词相对于第二查询语句的变化特征。
处理器402还可以获取第一查询语句的各个词的统计特征,并将各个词的统计特征和变化特征输入识别模型来识别第一查询语句中的停用词。处理器402使用的识别模型可以为一段程序代码,程序代码可以存储于存储器404中,处理器404对识别模型进行训练,或使用识别模型来识别停用词时,调用该段代码。识别模型也可以通过硬件实现,则处理器402将第一查询语句的各个词的统计特征和变化特征输入至该硬件,该专有硬件返回给处理器402识别结果,该硬件可以为现场可编程逻辑门阵列(英文:field-programmable gate array,缩写:FPGA)。
第一查询语句中的各个词的统计特征体现了第一查询语句中的各个词在信息检索系统的文件库中的统计信息,将各个词的统计特征加入到停用词的识别过程中也有助于识别第一查询语句中的停用词。
本发明还提供了一种停用词识别方法,图1、图2中的检索设备202以及图3中的计算设备400运行时执行该停用词识别方法,其流程示意图如图4所示。
步骤602,接收第一查询语句,获取第一查询语句对应的会话ID。
本实施例中以信息检索系统接收的第一查询语句为“backstreet boys ‘the one’”为例,获取该“backstreet boys‘the one’”对应的会话ID。此时一般有两种情况,如果该查询语句是一个新会话的第一个查询语句,那么步骤602中需要为该查询语句生成一个会话ID,如果该查询语句属于一个已存在的会话,则步骤602中获取的会话ID即为该已存在的会话ID。
可选的,步骤602之后还执行了步骤604。
步骤604,根据第一查询语句的各个词查询信息检索系统的词特征库,获取第一查询语句的各个词的统计特征。
首先从“backstreet boys‘the one’”中获取“backstreet”、“boys”、“the”和“one”四个词,如果步骤602中接收的查询语句为无边界语言,则需要对查询语句进行分词以获取查询语句中的各个词。获取这四个词的统计特征,例如词频、词频均值、词频方差等,词特征库是信息检索系统通过对一定数量的或一定时间周期内获得的文件内出现的各个词的各种特征进行统计后获得的,因此查询语句包含的各个词在词特征库中可以查询到其对应的各种统计特征值。信息检索系统常通过数组存储查询语句,例如Query1[6][n+2]={{backstreet,1,统计特征1,统计特征2…统计特征n},{boys,2,统计特征1,统计特征2…统计特征n},{‘,3},{the,4,统计特征1,统计特征2…统计特征n},{one,5,统计特征1,统计特征2…统计特征n},{’,6}},其中{backstreet,1,统计特征1,统计特征2…统计特征n}表示查询语句的第一个词为“backstreet”,其中的统计特征1至统计特征n为“backstreet”一词的各个统计特征。第一查询语句中的各个词的统计特征体现了第一查询语句中的各个词在信息检索系统的文件库中的统计信息,分析各个词的统计特征也有助于识别停用词。
信息检索系统接收了第一查询语句之后会对第一查询语句进行处理并转化为数据结构存储起来,例如上文中的Query1[6][n+2],除了获取各词的统计特征之外,可选的,还获取了第一查询语句中各个词在第一查询语句中的语句特征,包括第一查询语句中各个词在第一查询语句中出现的次数、各个词的词性、各个词的语法类别、各个词在第一查询语句中的位置、各个词的前后是否是空格、各个词是否是用引号括起等,如果语句特征共有m种,则获取“backstreet boys‘the one’”后会转化为Query1[6][n+m+2]。第一查询语句的各个词的在第一查询语句的语句特征体现了各个词在第一查询语句中的特点,分析各个 词的语句特征也有助于识别停用词。
步骤606,根据第一查询语句对应的会话ID,获取与第一查询语句属于同一会话的第二查询语句。
根据第一查询语句对应的会话ID查询历史查询语句,获取与查询语句“backstreet boys‘the one’”属于同一会话的第二查询语句,以此处获得的第二查询语句为“the one backstreet boys”为例,即Query2[4][2]={{the,1},{one,2},{backstreet,3},{boys,4}},第二查询语句中各个词的统计特征、语句特征等在Query2中省略。由于同一会话中可能包含多个查询语句,可选的,步骤606中获取的查询语句可以为步骤602中的查询语句之前的上一条查询语句。用户在查询过程中,如果通过一次查询未获取所需的文件,则会调整查询语句再次查询,因此相邻的查询语句能够形成语句链的可能性更高,相邻的查询语句之间的变化特征也更有助于停用词的识别。历史查询语句中记录了各个会话ID与该会话ID对应的会话所属的查询语句,历史查询语句还可以包括属于同一语句链的查询语句的信息。
需要说明的是,步骤604和步骤606的执行顺序可以互换或者两者并行处理。由于步骤604中对查询语句进行处理获得的词在查询语句中的统计特征以及语句特征,以及执行步骤606、步骤608和步骤610获取的词的变化特征均可以用于步骤612中,因此执行步骤602后,步骤604和步骤606可以并行处理,其中步骤606这一支路后续还包括步骤608和步骤610。
可选的,步骤606之后还执行步骤608。
步骤608,确定获取的第二查询语句符合以下条件之任一:第一查询语句与第二查询语句的最长公共子句的长度大于第一阈值;或者第一查询语句转换为第二查询语句所需的最少操作数小于第二阈值;或者将第一查询语句映射为第一向量,并将第一查询语句映射为第二向量,第一向量与第二向量之间的夹角小于第三阈值;或者第一查询语句与第二查询语句的最长公共子句的长度,与第一查询语句和第二查询语句的长度之和的比值大于第四阈值;或者第一查询语句与第二查询语句的最长公共子句的长度,与第一查询语句和第二查询语句中长度较短者的长度的比值大于第五阈值;或者第一向量与第二向量之间的距离小于第六阈值。
第一查询语句所属的会话可能包括多个查询语句,因此通过步骤608中这多个查询语句中进一步筛选出符合条件的第二查询语句,也即筛选出能够与第一查询语句组成语句链的第二查询语句。筛选的标准可以基于以下条件中的任意一个或者多个之间的任意组合。确定第二查询语句与第一查询语句符合以下条件中的任意一个或者多个之间的任意组合,也即确定第二查询语句与第一查询语句能够组成一个语句链,如果能够组成一个语句链,则继续执行步骤610,如果无法组成一个语句链,则执行步骤612。
条件一,判断“backstreet boys‘the one’”与“the one backstreet boys”之间的最长公共子句的长度与第一阈值的大小关系。本实施例中,“backstreet boys‘the one’”与“the one backstreet boys”的最长公共子句的长度为2。如果第一阈值为1,则“backstreet boys‘the one’”与“the one backstreet boys”可以构成一个语句链。子句的长度即子句中包含的词的个数。
将Query1[6][2]={{backstreet,1},{boys,2},{‘,3},{the,4},{one,5},{’,6}}与Query2[4][2]={{the,1},{one,2},{backstreet,3},{boys,4}}中的元素依次比较,即可获得Query1与Query2的最长公共子句为“backstreet boys”和“the one”,两个子句的长度均为2,因此Query1与Query2的最长公共子句的长度为2。
条件二,判断“backstreet boys‘the one’”最少需要多少操作才能转换为“the one backstreet boys”,或“the one backstreet boys”最少需要多少操作才能转换为“backstreet boys‘the one’”,如果所需的最少操作数小于第二阈值,则可以判断“backstreet boys‘the one’”与“the one backstreet boys”属于同一语句链。本例中,“the one backstreet boys”至少需要将句首的“the”和将“one”删除,并在句末添加“the”和“one”,并在“the one”两端加上引号,一共6个操作,才能变为“backstreet boys‘the one’”。
通过比较Query1与Query2可知,由于两者的最长公共子句为“backstreet boys”和“the one”,因此如果想将“the one backstreet boys”转换为“backstreet boys‘the one’”,最少需要6个操作,即将Query1中的{the,1}和{one,2}删除,则{backstreet,3}和{boys,4}变为{backstreet,1}和{boys, 2},再在{boys,2}后加上{the,4},{one,5},并在新增的{the,1},{one,2}两端分别加上{‘,3}和{’,6}。
条件三,计算“backstreet boys‘the one’”和“the one backstreet boys”转化为向量之后的夹角,如果夹角小于第三阈值,则“backstreet boys‘the one’”和“the one backstreet boys”能够组成一个语句链。常用的将查询语句转化为向量的方法包括:1、建立向量空间模型(英文:vector space modal,缩写:VSM),以词特征库中每一个词作为一个维度,则VSM中维度数等于词特征库中词的个数,将“the one backstreet boys”映射至VSM中,则VSM中“the”,“one”,“backstreet”,“boys”这四个词对应的维度将会被赋值,该值可以表征该词的出现或者为该词的统计特征,通过VSM的建立,“backstreet boys‘the one’”和“the one backstreet boys”会变为VSM空间的两个向量,因此可以计算得到这两个向量的夹角或距离;2、Word2vec,bags of words,word embedding等将语句转换为向量的方法。
条件四与条件一的区别在于,判断的为“backstreet boys‘the one’”和“the one backstreet boys”的最长公共字句的长度与“backstreet boys‘the one’”和“the one backstreet boys”的长度之和的比值与第四阈值的大小关系,如果比值大于第四阈值则可以判断“backstreet boys‘the one’”与“the one backstreet boys”属于同一语句链。查询语句的长度即查询语句中包含的词的个数。
条件五与条件一的区别在于,判断的为“backstreet boys‘the one’”和“the one backstreet boys”的最长公共字句的长度与“backstreet boys‘the one’”和“the one backstreet boys”中长度较短者的长度的比值与第五阈值的大小关系,如果比值大于第五阈值则可以判断“backstreet boys‘the one’”与“the one backstreet boys”属于同一语句链。
条件六与条件三的区别在于,将“backstreet boys‘the one’”和“the one backstreet boys”转化为向量后,计算两个向量之间的距离,如果距离小于第六阈值,则“backstreet boys‘the one’”和“the one backstreet boys”能够组成一个语句链。条件六中两个向量之间的距离可以为欧几里得距离(英文:euclidean distance)。
需要说明是,这六种条件各会生成一个参数,分别为最长公共子句的长度、 最少操作数、向量之间的夹角、最长公共子句的长度与两查询语句长度之和的比值、最长公共子句的长度与较短查询语句长度的比值、向量之间的距离,因此步骤608在实际使用中,可以任意组合使用六种参数中的任意两种或多种,例如为这六个参数配以权重再求和后获取一个总参数后,再将该总参数与阈值作为对比以判断两个查询语句能够组成语句链。
由于用户使用信息检索系统的过程中可能会更换检索目标,而用户在检索不同目标时使用的查询语句的变化一般较大,而属于相同或相似检索目标的两个查询语句之间的变化特征对于停用词的识别效果更加优良,因此以上的六种条件实质是用于确定第一查询语句与第二查询语句之间的区别较小,以获得检索目标相同或相似的第一查询语句和第二查询语句。
步骤608中,如果确定了两个查询语句可以构成一个语句链,则可以将步骤608判断的结果存入历史查询语句中。存储格式例如语句链1:查询语句A,查询语句B;语句链2:查询语句C,查询语句D。这样如果信息检索系统再次接收到相同的查询语句,则无须进行上述步骤608的确定过程,直接读取历史的确定结果即可。同时,对于语句链2,本身仅包括查询语句C和查询语句D,当接收到查询语句E并判断出查询语句E与查询语句D同属一个语句链的情况下,不仅可以创建语句链3:查询语句D,查询语句E,还可以创建出语句链4:查询语句C,查询语句D,查询语句E,当然也可以用语句链4替换掉语句链2,这样信息检索系统中存储的语句链信息将会进一步丰富,后续提取的查询语句之间的变化特征也更为丰富。
步骤610,获取第一查询语句的各个词相对于第二查询语句的变化特征。第一查询语句的各个词相对于第二查询语句的变化特征,指示第一查询语句中的各个词相对于第二查询语句的各种变化。
可选的,第一查询语句的各个词相对于第二查询语句的变化特征包括以下之任一:第一变化特征,用于指示第一查询语句的各个词相对于第二查询语句是否为新增词;第二变化特征,用于指示第一查询语句和第二查询语句均包含的词,在第一查询语句的位置相对于在第二查询语句的位置的变化;第三变化特征,用于指示第一查询语句和第二查询语句均包含的词,在第一查询语句中的词性相对于在第二查询语句中的词性的变化;第四变化特征,用于指示第一查询语句和第二查询语句均包含的词,在第一查询语句中的语法类别相对于在 第二查询语句中的语法类别的变化;第五变化特征,用于指示第一查询语句和第二查询语句均包含的词,在第一查询语句中的两端标点符号相对于在第二查询语句中两端标点符号的变化。
第一变化特征,用于指示第一查询语句的各个词相对于第二查询语句是否为新增词。一般第一查询语句相对于第二查询语句新增的词为停用词的可能性较低,例如,第二查询语句为“backstreet boys”,第一查询语句为“backstreet boys the one”,那么“the one”则是第一查询语句相对于第二查询语句的新增词。
以第一查询语句Query A[m][n]={{word11,m11,,m12…,m1n},{word12,m21,,m22…,m2n}…{word1m,mm1,,mm2…,mmn}},第二查询语句Query B[x][y]={{word21,m11,,m12…,m1y},{word22,m21,,m22…,m2y}…{word2x,mx1,mx2…,mxy}}为例。其中Query A中word11为一个词或标点符号,m11为该词在A中出现的位置,m12为该词在A中出现的语法类别,m13为该词的词性,m14至m1n指示word11的其他统计特征或word11在第一查询语句中的语句特征。
比较Query A与Query B,可以判断Query A中各个word相对于Query B是否为新增的,则可以识别出第一查询语句中的各个词相对于第二查询语句是否为新增词。
第二变化特征,用于指示在第一查询语句中各个词在第一查询语句中的位置相对于在第二查询语句的位置的变化,一般如果一个词在第一查询语句中的位置相对于在第二查询语句中的位置移动了,则说明该词的重要性较高,也即该词为停用词的可能性较低。
比较Query A与Query B中每一行中对应的元素,例如如果word11与word22相同(word11为词),而m11与m21不同,则说明word11对应的词在第一查询语句的位置相对于在第二查询语句的位置变化了,因此第二特征可以指示是否位置是否变化,也可以指示变化幅度,也即m11与m21之差。
第三变化特征,第一查询语句和第二查询语句均包含的词,在第一查询语句中的词性相对于在第二查询语句中的词性的变化,不同词性的词为停用词的可能性不同,例如一般名词为停用词的概率比形容词为停用词的概率低,则如果为不同词性赋予不同的特征值,则如果一个词在第一查询语句中的词性相对于第二查询语句中的词性变化,则第三变化特征可以为该词在第一查询语句中 的词性的特征值减去在第二查询语句中的词性的特征值。
比较Query A与Query B中每一行中对应的元素,例如如果word11与word22相同(word11为词),而m13与m23不同,则说明word11对应的词在第一查询语句的词性相对于在第二查询语句的词性变化了。
第四变化特征,第一查询语句和第二查询语句均包含的词,在第一查询语句中的语法类别相对于在第二查询语句中的语法类别的变化,不同语法类别的词为停用词的可能性不同,如果为不同语法类别,例如主语、谓语、宾语,赋予不同的特征值,则如果一个词在第一查询语句中的语法类别相对于第二查询语句中的语法类别变化,则第五变化特征可以为该词在第一查询语句中的语法类别的特征值减去在第二查询语句中的语法类别的特征值。
比较Query A与Query B中每一行中对应的元素,例如如果word11与word22相同(word11为词),而m12与m22不同,则说明word11对应的词在第一查询语句的语法类别相对于在第二查询语句的语法类别变化了。
第五变化特征,第一查询语句和第二查询语句均包含的词,在第一查询语句中的两端标点符号相对于在第二查询语句中两端标点符号的变化。词的两端的语标点符号不同的情况下该词为停用词的可能性不同,例如一个词在第一查询语句中相对于第二查询语句中两端多了空格或者该词用双引号或单引号括起,则说明该词为停用词的可能较低,则第六变化特征还可以指示该替换词为停用词的可能较低。例如,第二查询语句为“backstreet boys the one”,第一查询语句为“backstreet boys‘the one’”,那么“the one”在第一查询语句中两端通过引号括起,“the one”一词为停用词的可能性较低。
比较Query A与Query B中每一行中对应的元素,例如如果word13与word23相同(word13为词),而位置在word13之前的标点符号word12和位置在word13之后的标点符号word14,与而位置在word23之前的标点符号word22和位置在word23之后的标点符号word24不同,则说明word13相对于word23两端的标点符号变化了,也可能word22和word24不为标点符号,则说明word13在第一查询语句中两端增加了标点符号。
步骤608中如果确定了“backstreet boys‘the one’”与“the one backstreet boys”可以构成一个语句链,则获取“backstreet boys‘the one’”相对于“the one backstreet boys”的变化特征。
以第一查询语句为“backstreet boys‘the one’”第二查询语句为“backstreet boys the one”为例,第一查询语句中的“the”和“one”相对于第二查询语句有两个变化特征,即上文中的第二变化特征,“the”和“one”的位置发生了变化,以及上文中的第五变化特征,“the one”两端增加了引号。
需要说明的是,如果“backstreet boys‘the one’”所在的会话中,出现了类似于步骤608中语句链4类似的包括多余两个查询语句的语句链,则步骤610中不仅可以获取“backstreet boys‘the one’”相对于“the one backstreet boys”的变化特征,还可以获取“backstreet boys‘the one’”相对于其他查询语句的变化特征,并且可以将获取的变化特征连同步骤608中获取的语句链的判断结果一同存入历史查询语句中,以便再次处理相同查询语句时使用。
除了采用上文中数组的方式来存储查询语句,还可以用面向对象的编程实现方式来实现,比如我们可以用如下类来表示Query和Word两个对象,其中Query类指示查询语句,Word类指示查询语句中的各个词。
Figure PCTCN2015096179-appb-000003
Figure PCTCN2015096179-appb-000004
采用了如上数据结构来存储各个查询语句和词后,如果需要比较Query M和Query N是否属于同一会话,则可以通过调用Query.isInTheSameSession(){Query M.sessionID,Query N.sessionID},其中isInTheSameSession(){}定义为根据输入的查询语句的session ID判断两个查询语句是否属于同一会话。
如果需要比较Query M和Query N是否属于同一语句链,则可以通过调用Query.isInTheSameQueryChain(){Query M.sessionID,Query N.sessionID},其中Query.isInTheSameQueryChain(){}定义为根据输入的查询语句判断两个查询语句是否属于同一语句链,也即该isInTheSameQueryChain(){}包括了步骤608中确定第一查询语句和第二查询语句属于同一语句链的方法。
类似的,上文中的第一变化特征就可以用Word.newWord()函数来实现,我们在已知Query1和Query2属于同一个语句链之后,可以运行Query1.Word中的每个词的newWord()函数来判定第一查询语句中的各个词相对于第二查询语句是否为新增词。其余变化特征的获取与之类似,根据Query1.Word中定义的各个函数可以获得Query1中各个词相对于Query2的各种变化特征。
步骤612,根据第一查询语句的各个词相对于第二查询语句的变化特征,识 别第一查询语句中的停用词。
识别第一查询语句中的停用词的方法可以有多种,示例性的为第一查询语句的各个词的各个变化特征赋值,如果各个变化特征的值越高说明该词为停用词的可能性越高,第一查询语句的任一词的各个变化特征的值的加权和如果大于预设的阈值,则该词被识别为停用词,如果不大于预设的阈值,则该词被识别为非停用词。
可选的,如果步骤602之后还包括步骤604也即获取了获取第一查询语句的各个词的统计特征,则步骤612可以根据第一查询语句的各个词的统计特征和第一查询语句的各个词相对于第二查询语句的变化特征,识别第一查询语句中的停用词。
可选的,如果步骤604中还获取了第一查询语句中各个词在第一查询语句中的语句特征,则步骤612可以根据第一查询语句的各个词的统计特征、第一查询语句中各个词在第一查询语句中的语句特征、第一查询语句的各个词相对于第二查询语句的变化特征来识别第一查询语句中的停用词,进一步提升停用词识别的精度。
可选的,信息检索系统还设置有识别模型,将第一查询语句的各个词的统计特征,第一查询语句的各个词相对于第二查询语句的各种变化特征,输入识别模型以识别第一查询语句中的各个词是否为停用词。该识别模型可以为阈值模型,例如如果对一个词的各个变化特征的值和统计特征的加权后求和的结果大于预设的阈值,则该词被识别为停用词,如果不大于预设的阈值,则该词被识别为非停用词,也可以为决策树、神经网络等学习模型。实际运用中,除了采用前述识别模型,我们还可以配置一些直接识别标准配合识别方法使用以加快识别过程,例如如果一个词在第一查询语句中相对于第二查询语句为新增的词,那么该词可以直接识别为非停用词。
如果步骤602之后还包括步骤604的情况下,如果第一查询语句无法与任一历史查询语句形成语句链,则无法获得第一查询语句相对于第二查询语句的变化特征,则步骤612中仅根据第一查询语句的各个词的统计特征来识别第一查询语句中的停用词。
传统的基于停用词表的停用词的识别方法,仅依赖人工设置的停词表或文 件的统计信息,无法将属于同一会话的查询语句之间的变化特征纳入停用词识别的过程中,例如“backstreet boys‘the one’”这一查询语句中,如果采用停用词表来识别各个词是否是停用词,那么“the”这一定冠词很容易被识别为停用词,然而在本例中“the one”是“backstreet boys”这一歌手组合的一首歌曲的歌名,“the”不能简单当作一个定冠词处理,如果将“the”当作停用词而在后续的检索过程中忽视了其代表的含义,将会对检索结果造成负面影响。本实施例提供的方法,将查询语句之间的变化特征加入到停用词识别的过程之中,使得信息检索系统可以更好的根据用户输入的查询语句的变化来进行停用词识别,避免了传统的通过停用词表来识别停用词的方法中,无法将用户对查询语句进行调整带来的查询语句的变化纳入到停用词识别过程带来的误差。
可选的,还包括步骤614,将第一查询语句中的停用词的统计特征和第一查询语句中的停用词相对于第二查询语句的变化特征作为正样本,将第一查询语句中除停用词外的其他词的统计特征和第一查询语句中除停用词外的其他词相对于第二查询语句的变化特征作为负样本,根据正样本和负样本对识别模型进行训练。
识别出第一查询语句中的停用词后,将第一查询语句中的停用词的统计特征和第一查询语句中的停用词相对于第二查询语句的变化特征作为正样本,将第一查询语句中除停用词外的其他词的统计特征和第一查询语句中除停用词外的其他词相对于第二查询语句的变化特征作为负样本,对识别模型进行训练,以使得识别模型后续执行停用词识别的时候,如果将接收到的一个词归类于正样本,则说明识别模型将该词识别为停用词,如果将接收到的一个词归类于负样本,则说明识别模型未将该词识别为停用词。通过对识别模型的训练,提升识别模型的精度。
可选的,步骤612中根据第一查询语句的各个词的统计特征、第一查询语句中各个词在第一查询语句中的语句特征、第一查询语句的各个词相对于第二查询语句的变化特征来识别第一查询语句中的停用词,则步骤614中将第一查询语句中的停用词在第一查询语句中的语句特征作为正样本,将第一查询语句中除停用词外的其他词在第一查询语句中的语句特征作为负样本对识别模型进行训练,进一步提升识别模型的精度。
信息检索系统每识别一条查询语句中的停用词后,可以存储获取的正样本和负样本,以便积累了一定时间或积累了一定数量的正样本和负样本后,执行步骤614。
可选的,步骤614之前,还将第一查询语句去除识别模型识别出的停用词获得候选检索词,根据候选检索词进行检索获得检索结果;在确定检索结果的正确性的情况下,执行训练。
根据候选检索词进行检索,将检索过程相关特征以及检索结果存入历史查询日志,检索过程相关特征包括第一查询语句的各个词的统计特征和第一查询语句的各个词相对于第二查询语句的变化特征,第一查询语句中各个词在第一查询语句中的语句特征。同时,步骤614后对于用户而言,已经可以获得基于“backstreet boys‘the one’”的检索结果,检索结果包括根据查询语句检索到的文件ID,历史查询日志中的存储格式参考如表2。信息检索系统获取各个文件ID之后,将文件ID对应的文件返回给用户,用户对这些文件进行的操作信息也记录在历史查询日志中,操作信息包括:用户获得查询获得的各个文件后,对各个文件的操作信息,例如用户点击了哪几个文件、点击动作的发生时间、在各个文件内的浏览时间等,其中点击一个文件的时刻与点击下一文件的时刻之间的间隔一般认为是前一文件的浏览时间。
查询语句1 检索过程相关特征 检索结果 操作信息
查询语句2 检索过程相关特征 检索结果 操作信息
查询语句n 检索过程相关特征 检索结果 操作信息
表2
确定第一查询语句对应的检索结果的正确性,即通过对第一查询语句对应的操作信息进行分析以确定用户对第一查询语句对应的检索结果的满意程度。例如,如果发现第一查询语句的检索结果中,用户点击其中一篇文件60秒内并未点击其他文件,则可以认为本次用户在本次的检索结果中找到了所需的文件,因此可以将第一查询语句中被识别为停用词的词以及其对应的检索过程相关特征可以作为识别模型的正样本,第一查询语句中未被识别为停用词的词以及其对应的检索过程相关特征可以作为识别模型的负样本,用以对识别模型进行训 练。
用以确定第一查询语句对应的检索结果的正确性的筛选条件的设置可以有多种,除了通过用户点击文件后未动作的时间超过阈值这一筛选条件,还可以通过确定用户在本次检索出的文件中点击的文件的数量超过阈值等,该筛选条件或者事件的设置能够表征用户认可本次检索结果中的文件的准确程度即可。
对识别模型的训练一般需要积累一定数量的正样本和负样本后再进行,因此当历史查询日志中积累了一定数量的查询语句或经过预设的时间后,通过对历史查询日志中各个查询语句对应的检索结果的操作信息进行分析,挖掘出适合作为识别模型的训练数据的查询语句对应的检索过程相关特征,并对识别模型进行训练。
用户在使用信息检索系统的过程中,对于检索结果中各个文件的操作反应了用户对于本次检索结果的正确性的判断,也反应了检索结果对应的停用词识别结果的准确与否,通过分析用户对各个查询语句对应的历史查询日志进行分析,可以得知用户对于哪一些查询语句的检索结果较为满意,则可以将该部分查询语句对应的停用词识别相关的参数和结果用于识别模型的训练。通过将用户的操作结果的反馈至停用词识别过程中使用的识别模型,提升了信息检索系统对于用户使用环境、习惯等的适应性,尤其对于一些特殊的使用场景下的信息检索系统中停用词的识别有较大作用,例如在超市中使用的信息检索系统中,“merchandise”一词的使用频率较高并且很可能并不指示特殊含义,因此当信息检索系统将“merchandise”作为停用词并进行检索时,用户对于检索结果的准确程度可能会比较认可,因此可以将“merchandise”一词的检索过程相关特征作为正样本对识别模型进行训练。
步骤616,根据新的文件,更新信息检索系统的词特征库。
信息检索系统可以检索到的文件会周期性的进行更新,因此对新的文件中各个词进行分析后,可以更新词特征库以提升信息检索系统的停词识别的准确率。步骤616的执行和步骤602至步骤614的执行可以互相独立,即词特征库的更新和识别模型的训练可以并行执行。步骤614和步骤616可以在线进行(随着用户输入查询语句执行),也可以离线执行(例如系统空闲,或系统集中维护、系统更新时执行),尤其因为步骤614和步骤616的执行均需要一定时间的历史查询日志的更新积累或文件的更新积累,步骤614和步骤616采用离线执行可 以避免在线执行对信息检索系统造成的处理压力。
本实施例提供的停用词识别方法,获取与待处理查询语句属于相同会话的查询语句,随后获取查询语句与待处理查询语句的词的变化特征,并将该变化特征纳入识别待处理查询语句中的停词的考虑因素中,使得停词识别的过程中,能够根据查询语句之间的变化特征进行停词识别,提升了停用词的识别精度,也即提升了信息检索系统输出的检索结果的准确性。
本发明实施例还提供了停用词识别装置800,该停用词识别装置800可以通过图1或图2所示的检索设备202实现,还可以通过图3所示的计算设备400实现,还可以通过专用集成电路(英文:application-specific integrated circuit,缩写:ASIC)实现,或可编程逻辑器件(英文:programmable logic device,缩写:PLD)实现。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,缩写:CPLD),FPGA,通用阵列逻辑(英文:generic array logic,缩写:GAL)或其任意组合。该停用词识别装置800用于实现图4所示的停用词识别方法。
停用词识别装置800的组织结构示意图如图5所示,包括:输入模块802和处理模块804。处理模块804工作时,执行图4所示的停用词识别方法中的步骤604至步骤616。
输入模块802,用于接收第一查询语句,获取第一查询语句对应的会话ID,也即执行图4所示的停用词识别方法中的步骤602。
处理模块804,用于根据第一查询语句对应的会话ID,获取与第一查询语句属于同一会话的第二查询语句;还用于获取第一查询语句的各个词相对于所述第二查询语句的变化特征;还用于根据第一查询语句的各个词相对于所述第二查询语句的变化特征,识别第一查询语句中的停用词。
处理模块804获取的第二查询语句与第一查询语句能够构成语句链,构成语句链的判断条件包括:第一查询语句与第二查询语句的最长公共子句的长度大于第一阈值;或者第一查询语句转换为第二查询语句所需的最少操作数小于第二阈值;或者将第一查询语句映射为第一向量,并将第一查询语句映射为第二向量,第一向量与第二向量之间的夹角小于第三阈值;或者第一查询语句与第二查询语句的最长公共子句的长度,与第一查询语句和第二查询语句的长度 之和的比值大于第四阈值;或者第一查询语句与第二查询语句的最长公共子句的长度,与第一查询语句和第二查询语句中长度较短者的长度的比值大于第五阈值;或者第一向量与第二向量之间的距离小于第六阈值。
处理单元804,还根据第一查询语句的各个词查询信息检索系统的词特征库,获取第一查询语句的各个词的统计特征;将第一查询语句的各个词相对于第二查询语句的变化特征和第一查询语句的各个词的统计特征输入识别模型,获得识别模型识别出的所述第一查询语句中的停用词。识别模型一般为一段代码,处理单元804对识别模型进行训练,或使用识别模型来识别停用词时,调用该段代码。
处理单元804,还将第一查询语句去除识别模型识别出的停用词获得候选检索词,根据候选检索词进行检索获得检索结果;在确定检索结果的正确性的情况下,将第一查询语句中的停用词的统计特征和第一查询语句中的停用词相对于第二查询语句的变化特征作为正样本,将第一查询语句中除停用词外的其他词的统计特征和第一查询语句中除停用词外的其他词相对于第二查询语句的变化特征作为负样本,根据正样本和负样本对识别模型进行训练。其中确定第一查询语句对应的检索结果的正确性,即通过对第一查询语句对应的操作信息进行分析以确定用户对第一查询语句对应的检索结果的满意程度,将用户满意的检索结果对应的查询语句的停用词识别中,识别出的停用词和非停用词的各类特征用于识别模型的训练,以进一步提升识别模型的精度。
本实施例提供的停用词识别装置,能够获取与待处理查询语句属于相同会话的查询语句,随后获取查询语句与待处理查询语句的词的变化特征,并将该变化特征纳入识别待处理查询语句中的停词的考虑因素中,使得停词识别的过程中,能够根据查询语句之间的变化特征进行停词识别,提升了停用词的识别精度,也即提升了信息检索系统输出的检索结果的准确性。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
结合本发明公开内容所描述的方法可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM、快闪存储器、ROM、可擦除可编程只读存储器(英文:erasable programmable read only  memory,缩写:EPROM)、电可擦可编程只读存储器(英文:electrically erasable programmable read only memory,缩写:EEPROM)、硬盘、光盘或者本领域熟知的任何其它形式的存储介质中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件或软件来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。存储介质可以是通用或专用计算机能够存取的任何可用介质。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、改进等,均应包括在本发明的保护范围之内。

Claims (21)

  1. 一种停用词识别方法,由运行在计算机设备上的信息检索系统执行,其特征在于,包括:
    接收第一查询语句,获取所述第一查询语句对应的会话标识ID;
    根据所述会话ID,获取与所述第一查询语句属于同一会话的第二查询语句;
    获取所述第一查询语句的各个词相对于所述第二查询语句的变化特征;
    根据所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词。
  2. 如权利要求1所述的方法,其特征在于,所述第一查询语句的各个词相对于所述第二查询语句的变化特征包括以下之任一:第一变化特征,用于指示所述第一查询语句的各个词相对于所述第二查询语句是否为新增词;第二变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句的位置相对于在所述第二查询语句的位置的变化;第三变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的词性相对于在所述第二查询语句中的词性的变化;第四变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的语法类别相对于在所述第二查询语句中的语法类别的变化;第五变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的两端标点符号相对于在所述第二查询语句中两端标点符号的变化。
  3. 如权利要求1或2所述的方法,其特征在于,获取的所述第二查询语句符合以下条件之任一:
    所述第一查询语句与所述第二查询语句的最长公共子句的长度大于第一阈值;或者
    所述第一查询语句转换为所述第二查询语句所需的最少操作数小于第二阈值;或者
    将所述第一查询语句映射为第一向量,并将所述第一查询语句映射为第二向量,所述第一向量与所述第二向量之间的夹角或距离小于第三阈值。
  4. 如权利要求1至3任意一项所述的方法,其特征在于,所述方法还包括:
    根据所述第一查询语句的各个词查询所述信息检索系统的词特征库,获取所述第一查询语句的各个词的统计特征;
    所述根据所述第一查询语句的各个词相对于所述第二查询语句的变化特征,判定所述第一查询语句中的停用词包括:根据所述第一查询语句的各个词的统计特征和所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词。
  5. 如权利要求4所述的方法,其特征在于,所述信息检索系统还设置有识别模型;
    所述根据所述第一查询语句的各个词的统计特征和所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词包括:
    将所述第一查询语句的各个词相对于所述第二查询语句的变化特征和所述第一查询语句的各个词的统计特征输入所述识别模型,获得所述识别模型识别出的所述第一查询语句中的停用词。
  6. 如权利要求5所述的方法,其特征在于,所述方法还包括:
    将所述第一查询语句中的停用词的统计特征和所述第一查询语句中的停用词相对于所述第二查询语句的变化特征作为正样本,将所述第一查询语句中除停用词外的其他词的统计特征和所述第一查询语句中除停用词外的其他词相对于所述第二查询语句的变化特征作为负样本,根据所述正样本和所述负样本对所述识别模型进行训练。
  7. 如权利要求6所述的方法,其特征在于,在进行所述训练之前,所述方法还包括:
    将所述第一查询语句去除所述识别模型识别出的停用词获得候选检索词,根据所述候选检索词进行检索获得检索结果;
    在确定所述检索结果的正确性的情况下,执行所述训练。
  8. 一种停用词识别装置,其特征在于,包括:
    输入模块,用于接收第一查询语句,获取所述第一查询语句对应的会话标识ID;
    处理模块,用于根据所述会话ID,获取与所述第一查询语句属于同一会话的第二查询语句;还用于获取所述第一查询语句的各个词相对于所述第二查询语句的变化特征;
    所述处理模块还用于根据所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词。
  9. 如权利要求8所述的装置,其特征在于,所述第一查询语句的各个词相对于所述第二查询语句的变化特征包括以下之任一:第一变化特征,用于指示所述第一查询语句的各个词相对于所述第二查询语句是否为新增词;第二变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句的位置相对于在所述第二查询语句的位置的变化;第三变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的词性相对于在所述第二查询语句中的词性的变化;第四变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的语法类别相对于在所述第二查询语句中的语法类别的变化;第五变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的两端标点符号相对于在所述第二查询语句中两端标点符号的变化。
  10. 如权利要求8或9所述的装置,其特征在于,获取的所述第二查询语句符合以下条件之任一:
    所述第一查询语句与所述第二查询语句的最长公共子句的长度大于第一阈值;或者
    所述第一查询语句转换为所述第二查询语句所需的最少操作数小于第二阈值;或者
    将所述第一查询语句映射为第一向量,并将所述第一查询语句映射为第二向量,所述第一向量与所述第二向量之间的夹角或距离小于第三阈值。
  11. 如权利要求8至10任意一项所述的装置,其特征在于,所述处理模块,还用于根据所述第一查询语句的各个词查询所述信息检索系统的词特征库,获 取所述第一查询语句的各个词的统计特征;
    所述处理模块根据所述第一查询语句的各个词相对于所述第二查询语句的变化特征,判定所述第一查询语句中的停用词包括:根据所述第一查询语句的各个词的统计特征和所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词。
  12. 如权利要求11所述的装置,其特征在于,所述处理模块还包括识别模型;
    所述处理模块根据所述第一查询语句的各个词的统计特征和所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词包括:将所述第一查询语句的各个词相对于所述第二查询语句的变化特征和所述第一查询语句的各个词的统计特征输入所述识别模型,获得所述识别模型识别出的所述第一查询语句中的停用词。
  13. 如权利要求12所述的装置,其特征在于,所述处理模块,还用于将所述第一查询语句中的停用词的统计特征和所述第一查询语句中的停用词相对于所述第二查询语句的变化特征作为正样本,将所述第一查询语句中除停用词外的其他词的统计特征和所述第一查询语句中除停用词外的其他词相对于所述第二查询语句的变化特征作为负样本,根据所述正样本和所述负样本对所述识别模型进行训练。
  14. 如权利要求13所述的装置,其特征在于,在进行所述训练之前,所述方法还包括:
    将所述第一查询语句去除所述识别模型识别出的停用词获得候选检索词,根据所述候选检索词进行检索获得检索结果;
    在确定所述检索结果的正确性的情况下,执行所述训练。
  15. 一种计算设备,其特征在于,包括处理器、存储器;
    所述处理器用于读取所述存储器中的程序执行以下操作:接收第一查询语句,获取所述第一查询语句对应的会话标识ID;还用于根据所述会话ID,获取与所述第一查询语句属于同一会话的第二查询语句;还用于获取所述第一查询 语句的各个词相对于所述第二查询语句的变化特征;还用于根据所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词。
  16. 如权利要求15所述的计算设备,其特征在于,所述第一查询语句的各个词相对于所述第二查询语句的变化特征包括以下之任一:第一变化特征,用于指示所述第一查询语句的各个词相对于所述第二查询语句是否为新增词;第二变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句的位置相对于在所述第二查询语句的位置的变化;第三变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的词性相对于在所述第二查询语句中的词性的变化;第四变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的语法类别相对于在所述第二查询语句中的语法类别的变化;第五变化特征,用于指示所述第一查询语句和所述第二查询语句均包含的词,在所述第一查询语句中的两端标点符号相对于在所述第二查询语句中两端标点符号的变化
  17. 如权利要求15或16所述的计算设备,其特征在于,所述处理器获取的所述第二查询语句符合以下条件之任一:所述第一查询语句与所述第二查询语句的最长公共子句的长度大于第一阈值;或者所述第一查询语句转换为所述第二查询语句所需的最少操作数小于第二阈值;或者将所述第一查询语句映射为第一向量,并将所述第一查询语句映射为第二向量,所述第一向量与所述第二向量之间的夹角或距离小于第三阈值。
  18. 如权利要求15至17任意一项所述的计算设备,其特征在于,所述处理器还用于,根据所述第一查询语句的各个词查询所述信息检索系统的词特征库,获取所述第一查询语句的各个词的统计特征;
    所述处理器根据所述第一查询语句的各个词相对于所述第二查询语句的变化特征,判定所述第一查询语句中的停用词包括:根据所述第一查询语句的各个词的统计特征和所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词。
  19. 如权利要求18所述的计算设备,其特征在于,所述处理器根据所述第一查询语句的各个词的统计特征和所述第一查询语句的各个词相对于所述第二查询语句的变化特征,识别所述第一查询语句中的停用词包括:将所述第一查询语句的各个词相对于所述第二查询语句的变化特征和所述第一查询语句的各个词的统计特征输入所述识别模型,获得所述识别模型识别出的所述第一查询语句中的停用词。
  20. 如权利要求19所述的计算设备,其特征在于,所述处理器还用于,将所述第一查询语句中的停用词的统计特征和所述第一查询语句中的停用词相对于所述第二查询语句的变化特征作为正样本,将所述第一查询语句中除停用词外的其他词的统计特征和所述第一查询语句中除停用词外的其他词相对于所述第二查询语句的变化特征作为负样本,根据所述正样本和所述负样本对所述识别模型进行训练。
  21. 如权利要求20所述的计算设备,其特征在于,所述处理器进行所述训练之前,还用于将所述第一查询语句去除所述识别模型识别出的停用词获得候选检索词,根据所述候选检索词进行检索获得检索结果;在确定所述检索结果的正确性的情况下,执行所述训练。
PCT/CN2015/096179 2015-12-01 2015-12-01 停用词识别方法与装置 WO2017091985A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
CN201580029727.5A CN108027814B (zh) 2015-12-01 2015-12-01 停用词识别方法与装置
PCT/CN2015/096179 WO2017091985A1 (zh) 2015-12-01 2015-12-01 停用词识别方法与装置
JP2017521535A JP6355840B2 (ja) 2015-12-01 2015-12-01 ストップワード識別方法および装置
EP15909502.5A EP3232336A4 (en) 2015-12-01 2015-12-01 Method and device for recognizing stop word
US15/693,971 US10019492B2 (en) 2015-12-01 2017-09-01 Stop word identification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/096179 WO2017091985A1 (zh) 2015-12-01 2015-12-01 停用词识别方法与装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/693,971 Continuation US10019492B2 (en) 2015-12-01 2017-09-01 Stop word identification method and apparatus

Publications (1)

Publication Number Publication Date
WO2017091985A1 true WO2017091985A1 (zh) 2017-06-08

Family

ID=58796113

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/096179 WO2017091985A1 (zh) 2015-12-01 2015-12-01 停用词识别方法与装置

Country Status (5)

Country Link
US (1) US10019492B2 (zh)
EP (1) EP3232336A4 (zh)
JP (1) JP6355840B2 (zh)
CN (1) CN108027814B (zh)
WO (1) WO2017091985A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491462B (zh) * 2018-03-05 2021-09-14 昆明理工大学 一种基于word2vec的语义查询扩展方法及装置
CN109947803B (zh) * 2019-03-12 2021-11-19 成都全景智能科技有限公司 一种数据处理方法、系统及存储介质
CN110765239B (zh) * 2019-10-29 2023-03-28 腾讯科技(深圳)有限公司 热词识别方法、装置及存储介质
CN111159526B (zh) * 2019-12-26 2023-04-07 腾讯科技(深圳)有限公司 查询语句处理方法、装置、设备及存储介质
CN111191450B (zh) * 2019-12-27 2023-12-01 深圳市优必选科技股份有限公司 语料清洗方法、语料录入设备及计算机可读存储介质
EP3901875A1 (en) 2020-04-21 2021-10-27 Bayer Aktiengesellschaft Topic modelling of short medical inquiries
CN114519090B (zh) * 2020-11-20 2023-11-21 马上消费金融股份有限公司 一种停用词的管理方法、装置及电子设备
EP4036933A1 (de) 2021-02-01 2022-08-03 Bayer AG Klassifizierung von mitteilungen über arzneimittel
US11914664B2 (en) 2022-02-08 2024-02-27 International Business Machines Corporation Accessing content on a web page

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080141278A1 (en) * 2006-12-07 2008-06-12 Sybase 365, Inc. System and Method for Enhanced Spam Detection
CN102567371A (zh) * 2010-12-27 2012-07-11 上海杉达学院 自动过滤停用词的方法
CN103455535A (zh) * 2013-05-08 2013-12-18 深圳市明唐通信有限公司 基于历史咨询数据构建知识库的方法
CN103914445A (zh) * 2014-03-05 2014-07-09 中国人民解放军装甲兵工程学院 数据语义处理方法

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4073989B2 (ja) * 1997-12-09 2008-04-09 株式会社東芝 自然言語検索入力装置
US6252988B1 (en) * 1998-07-09 2001-06-26 Lucent Technologies Inc. Method and apparatus for character recognition using stop words
US6514140B1 (en) * 1999-06-17 2003-02-04 Cias, Inc. System for machine reading and processing information from gaming chips
JP2001325104A (ja) * 2000-05-12 2001-11-22 Mitsubishi Electric Corp 言語事例推論方法、言語事例推論装置及び言語事例推論プログラムが記録された記録媒体
US7409383B1 (en) 2004-03-31 2008-08-05 Google Inc. Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
US8438142B2 (en) * 2005-05-04 2013-05-07 Google Inc. Suggesting and refining user input based on original user input
US9110975B1 (en) * 2006-11-02 2015-08-18 Google Inc. Search result inputs using variant generalized queries
US8498980B2 (en) * 2007-02-06 2013-07-30 Nancy P. Cochran Cherry picking search terms
US8352469B2 (en) * 2009-07-02 2013-01-08 Battelle Memorial Institute Automatic generation of stop word lists for information retrieval and analysis
US8131735B2 (en) * 2009-07-02 2012-03-06 Battelle Memorial Institute Rapid automatic keyword extraction for information retrieval and analysis
US8688727B1 (en) * 2010-04-26 2014-04-01 Google Inc. Generating query refinements
US9009144B1 (en) * 2012-02-23 2015-04-14 Google Inc. Dynamically identifying and removing potential stopwords from a local search query
CN103902552B (zh) * 2012-12-25 2019-03-26 深圳市世纪光速信息技术有限公司 停用词的挖掘方法和装置、搜索方法和装置、评测方法和装置
CA2899314C (en) * 2013-02-14 2018-11-27 24/7 Customer, Inc. Categorization of user interactions into predefined hierarchical categories

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080141278A1 (en) * 2006-12-07 2008-06-12 Sybase 365, Inc. System and Method for Enhanced Spam Detection
CN102567371A (zh) * 2010-12-27 2012-07-11 上海杉达学院 自动过滤停用词的方法
CN103455535A (zh) * 2013-05-08 2013-12-18 深圳市明唐通信有限公司 基于历史咨询数据构建知识库的方法
CN103914445A (zh) * 2014-03-05 2014-07-09 中国人民解放军装甲兵工程学院 数据语义处理方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JINGBIN: "Chinese Spoken Document Retrieval Method Based on Stop-word Processing", CHINESE MASTER'S THESES FULL-TEXT DATABASE, 15 July 2012 (2012-07-15), XP009502802, ISSN: 1674-0246 *

Also Published As

Publication number Publication date
CN108027814A (zh) 2018-05-11
EP3232336A1 (en) 2017-10-18
US10019492B2 (en) 2018-07-10
EP3232336A4 (en) 2018-03-21
CN108027814B (zh) 2020-06-16
JP2018501540A (ja) 2018-01-18
JP6355840B2 (ja) 2018-07-11
US20180004815A1 (en) 2018-01-04

Similar Documents

Publication Publication Date Title
WO2017091985A1 (zh) 停用词识别方法与装置
CN111104794B (zh) 一种基于主题词的文本相似度匹配方法
CN108376151B (zh) 问题分类方法、装置、计算机设备和存储介质
US11544459B2 (en) Method and apparatus for determining feature words and server
US10459971B2 (en) Method and apparatus of generating image characteristic representation of query, and image search method and apparatus
WO2020244073A1 (zh) 基于语音的用户分类方法、装置、计算机设备及存储介质
KR100544514B1 (ko) 검색 쿼리 연관성 판단 방법 및 시스템
WO2022110637A1 (zh) 问答对话评测方法、装置、设备及存储介质
CN112069298A (zh) 基于语义网和意图识别的人机交互方法、设备及介质
WO2020259280A1 (zh) 日志管理方法、装置、网络设备和可读存储介质
WO2021051517A1 (zh) 基于卷积神经网络的信息检索方法、及其相关设备
CN111767716A (zh) 企业多级行业信息的确定方法、装置及计算机设备
US11657076B2 (en) System for uniform structured summarization of customer chats
CN110134777B (zh) 问题去重方法、装置、电子设备和计算机可读存储介质
US11526512B1 (en) Rewriting queries
CN110990532A (zh) 一种处理文本的方法和装置
CN114266256A (zh) 一种领域新词的提取方法及系统
KR20210089340A (ko) 문서 내 텍스트를 분류하는 방법 및 장치
CN111324705A (zh) 自适应性调整关连搜索词的系统及其方法
CN110929509B (zh) 一种基于louvain社区发现算法的领域事件触发词聚类方法
CN110705285B (zh) 一种政务文本主题词库构建方法、装置、服务器及可读存储介质
CN111930949A (zh) 搜索串处理方法、装置、计算机可读介质及电子设备
CN110941713B (zh) 基于主题模型的自优化金融资讯版块分类方法
WO2017088126A1 (zh) 获取未登录词的方法与装置
CN113095073B (zh) 语料标签生成方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2017521535

Country of ref document: JP

Kind code of ref document: A

REEP Request for entry into the european phase

Ref document number: 2015909502

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15909502

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE