WO2018033030A1 - 自然语言文句库的生成方法及装置 - Google Patents

自然语言文句库的生成方法及装置 Download PDF

Info

Publication number
WO2018033030A1
WO2018033030A1 PCT/CN2017/097213 CN2017097213W WO2018033030A1 WO 2018033030 A1 WO2018033030 A1 WO 2018033030A1 CN 2017097213 W CN2017097213 W CN 2017097213W WO 2018033030 A1 WO2018033030 A1 WO 2018033030A1
Authority
WO
WIPO (PCT)
Prior art keywords
natural language
statement
probability
language sentence
character
Prior art date
Application number
PCT/CN2017/097213
Other languages
English (en)
French (fr)
Inventor
牛国扬
陈虹
温海娇
许慢
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to EP17841003.1A priority Critical patent/EP3508990A4/en
Publication of WO2018033030A1 publication Critical patent/WO2018033030A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs

Definitions

  • Embodiments of the present invention relate to the field of computers, and in particular, to a method and an apparatus for generating a natural language sentence library.
  • AI artificial intelligence
  • NLP natural language processing
  • the embodiment of the invention provides a method and a device for generating a natural language sentence library, so as to at least solve the problem that the method of constructing the natural language sentence database in the related art requires a large amount of manual intervention, and the operation process is complicated.
  • a method for generating a natural language sentence library including:
  • RNN cyclic neural network
  • acquiring the word information according to the training data set includes: counting the frequency of occurrence of each character in the training data set, wherein the character includes at least one of the following: a character, a number, and a symbol; the frequency of occurrence is greater than the preset.
  • the characters of the threshold are sorted in a preset order to obtain word information.
  • the RNN model includes: an input layer, a hidden layer, and an output layer, wherein the input layer is adjacent to the hidden layer, and the hidden layer is adjacent to the output layer.
  • generating a natural language sentence library includes: extracting the number of hidden layers and the nerves of each hidden layer from the RNN model parameters configured for the RNN model.
  • the natural language sentence library is used to verify whether the currently received statement is an abnormal statement; Use the natural language sentence library to predict the characters that appear in the currently received statement.
  • the use of the natural language sentence library to verify whether the currently received statement is an abnormal statement includes: determining the number of characters included in the currently received statement and the verification direction of the currently received statement; The language sentence library calculates a probability of each character included in the currently received statement according to the verification direction; and calculates a probability that the currently received statement is a normal sentence according to the probability of each character.
  • the natural language sentence library is used to predict the characters that are currently appearing in the received statement, including: determining the number of characters included in the currently received statement, the verification direction of the currently received statement, and the character to be predicted.
  • the number of candidate characters; the probability of each character contained in the currently received statement is calculated in the natural language sentence library according to the verification direction; The probability of occurrence of each candidate character is calculated by the probability of the characters and the number of candidate characters to be predicted.
  • a device for generating a natural language sentence library including:
  • Obtaining a module configured to obtain word information and a word vector according to the training data set; the conversion module is configured to convert the word information into a test set to be identified by using a word vector of a preset dimension; generating a module, set to pass through the circulating neural network
  • the test set to be identified in the RNN model is trained to generate a natural language sentence library.
  • the obtaining module includes: a statistics unit configured to perform statistics on the frequency of occurrence of each character in the training data set, wherein the characters include at least one of the following: a character, a number, a symbol; the first acquiring unit, setting To sort the characters whose frequency is greater than the preset threshold in a preset order, the word information is obtained.
  • the RNN model includes: an input layer, a hidden layer, and an output layer, wherein the input layer is adjacent to the hidden layer, and the hidden layer is adjacent to the output layer.
  • the generating module includes: an extracting unit configured to extract the number of hidden layers and the number of neurons in each hidden layer from the RNN model parameters configured for the RNN model, and the training data intercept length; the first calculation The unit is configured to calculate the number of neurons in the input layer according to the length of the training data and the preset dimension; the setting unit is configured to set the number of neurons in the output layer according to the number of characters included in the word information; the generating unit is set to According to the number of hidden layers and the number of neurons in each hidden layer, the number of neurons in the input layer and the number of neurons in the output layer are trained for each character in the test set to be generated, and a natural language sentence library is generated.
  • the apparatus further includes: a processing module, configured to use a natural language sentence library to verify whether the currently received statement is an abnormal statement; or use a natural language sentence library to predict a character of the currently received statement. .
  • the processing module includes: a determining unit, configured to determine a number of characters included in the currently received statement and a verification direction of the currently received statement; and a second calculating unit, configured to be in the natural language sentence library Calculate the probability of each character contained in the currently received statement according to the verification direction; the third calculation unit is set to calculate according to the probability of each character The probability that the currently received statement is a normal statement.
  • the processing module includes: a determining unit, configured to determine a number of characters included in the currently received statement, a verification direction of the currently received statement, and a number of candidate characters to be predicted; a unit, configured to calculate a probability of each character included in the currently received statement according to the verification direction in the natural language sentence library; the third calculation unit is set to the probability according to each character and the number of candidate characters to be predicted Calculate the probability of occurrence of each candidate character.
  • a storage medium comprising a stored program, wherein the program is executed to perform the method of any of the above.
  • the word information is obtained according to the training data set, the word information is converted into the test set to be identified by using the word vector of the preset dimension, and the test set to be identified in the RNN model is trained to generate the natural language.
  • the way of the sentence library solves the problem that the method of constructing the natural language sentence library in the related technology requires a lot of manual intervention and the operation process is complicated, and thus the recognition rate is high, easy to use, and can satisfy the question answering system, the retrieval system, the expert system.
  • the effect of NLP business needs such as online customer service, mobile assistants, and personal assistants.
  • FIG. 1 is a flowchart of a method for generating a natural language sentence library according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a process of applying a natural language sentence library to perform a spelling error correction process in accordance with a preferred embodiment of the present invention
  • FIG. 3 is a schematic diagram of an input association process performed using a natural language sentence library in accordance with a preferred embodiment of the present invention
  • FIG. 4 is a diagram of applying a natural language sentence library to perform a sentence judgment according to a preferred embodiment of the present invention. Schematic diagram of the process
  • FIG. 5 is a schematic diagram of a process of generating a dialog by applying a natural language sentence library according to a preferred embodiment of the present invention
  • FIG. 6 is a flowchart of a device for generating a natural language sentence library according to an embodiment of the present invention.
  • FIG. 7 is a flow chart of a device for generating a natural language sentence library in accordance with a preferred embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for generating a natural language sentence library according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps:
  • Step S12 acquiring word information according to the training data set
  • Step S14 converting the word information into a test set to be identified by using a word vector of a preset dimension
  • Step S16 training a test set to be identified in a Recursive Neural Networks (RNN) model to generate a natural language sentence library.
  • RNN Recursive Neural Networks
  • a method for constructing a natural language sentence library is provided, which is mainly for question answering system, retrieval system, expert system, online customer service, mobile assistant, personal assistant, etc., and processes Chinese natural language; especially suitable for natural language. Handling NLP, artificial intelligence AI, smart question and answer, text mining and other fields.
  • the natural language sentence library provided by the embodiment of the present invention is implemented based on the RNN cyclic neural network model of deep learning, which fully utilizes a large amount of information contained in the context, summarizes the relationship between words and words, and between words and sentences.
  • Relationship the relationship between words and characters, the relationship between characters and characters, and the relationship between words and context Department, creating natural language basic knowledge data for supporting a variety of NLP natural language processing tasks; that is, acquiring word information and word vectors according to the training data set, and by training data sets, word information, word vectors, and in the RNN model
  • the pre-configured RNN model parameters are trained to generate a natural language sentence library, which solves the problem that the method of constructing the natural language sentence database in the related art requires a lot of manual intervention, and the operation process is complicated, thereby achieving less manual intervention and simple Fast, easy to implement, and highly accurate.
  • the RNN model may include: an input layer, a hidden layer, and an output layer, wherein the input layer is adjacent to the hidden layer, and the hidden layer is adjacent to the output layer.
  • Pre-configured RNN model parameters may include, but are not limited to:
  • the selected news information may be segmented, each line, or of course, without segmentation, one line per article, and the result is stored in the data.txt text, and its form is as follows:
  • Zhang Ying not only claimed to be a CCTV reporter, but also threatened him to expose and kicked him a few feet.
  • step S12 acquiring the word information according to the training data set may include the following steps:
  • Step S121 performing statistics on the frequency of occurrence of each character in the training data set, wherein the characters include at least one of the following: characters, numbers, symbols;
  • Step S122 Sort characters whose appearance frequency is greater than a preset threshold according to a preset order to obtain word information.
  • the number in the front is not the number of words, but the number of the words.
  • the numbers are arranged in order from 0 to the largest; after the number is determined, the number corresponding to each word is trained. And the process of testing will not change.
  • the reason for choosing a word vector rather than a word vector is that the number of words is relatively small and relatively stable, while the number of words is large and new words are constantly emerging.
  • the statement itself may have typos, which may affect the accuracy of the word segmentation. Therefore, the preferred embodiment of the present invention uses word vectors instead of word vectors.
  • step S16 by training the test set to be identified in the RNN model, generating the natural language sentence library may include the following execution steps:
  • Step S161 extracting the number of hidden layers and the number of neurons in each hidden layer from the RNN model parameters configured for the RNN model, and training the data intercept length;
  • Step S162 calculating the length of the training data and the preset dimension to calculate the god of the input layer Number of longitudes
  • Step S163 setting the number of neurons in the output layer according to the number of characters included in the word information
  • Step S164 according to the number of hidden layers and the number of neurons in each hidden layer, the number of neurons in the input layer and the number of neurons in the output layer are trained to identify each character in the test set to generate a natural language sentence library.
  • the matrix that is, the sentence library under the RNN model, then saves the RNN weight matrix as a binary file weight.rnn for subsequent natural language processing.
  • the training needs to perform forward training and reverse training separately, so as to obtain two sets of weight coefficients.
  • the embodiment of the present invention is a text library implemented based on the RNN cyclic neural network, which is a deep learning (DL) based sentence library.
  • DL deep learning
  • the RNN cyclic neural network According to the specific structure of the RNN cyclic neural network, it can be determined to include the following parts:
  • the number of neurons K in the output layer can be consistent with the number of word information
  • the purpose of the above training process is to obtain an N-weight matrix, and the value of N is related to the number of hidden layers of the RNN;
  • the number of neurons is 200, one input layer, the number of neurons is 50 and one output layer, and the number of neurons is 5000; in addition, the input layer is represented by i. h denotes a hidden layer, and k denotes an output layer, from which three weight matrices can be obtained:
  • Wih input layer the weight matrix of the hidden layer
  • the number of weight matrices finally obtained is 2N.
  • the word information is used when the RNN calculates an error
  • the word vector is used when the word of the "training data set" is converted into digital information recognizable by the computer.
  • ⁇ h is the number of input nodes of the activation function I H hidden layer neurons
  • ⁇ k is the Softmax function H is the number of hidden layer neurons
  • K is the number of output layer neurons
  • the purpose is to obtain an N-pair weight matrix, that is, a W weight file.
  • the format of the weight file is: P-Q, X-Y, W;
  • P is the upper-layer neural network serial number
  • Q is the lower-layer neural network serial number
  • X is the upper-layer neuron serial number
  • Y is the lower-layer neuron serial number
  • W is the weight corresponding to the connection between two different neuron serial numbers in the RNN model
  • step S16 after training the test set to be identified in the RNN model, after generating the natural language sentence library, the following execution steps may also be included:
  • Step S17 using a natural language sentence library to verify whether the currently received statement is an abnormal statement
  • step S18 the natural language sentence library is used to predict the characters that appear in the currently received statement.
  • NLP online interface that is, through the NLP online interface, natural language processing functions such as spell correction, input association, sentence judgment, and dialog generation can be realized; wherein the NLP online interface can also Including: probability interface and prediction interface, the two types of sending message format and receiving message are as follows:
  • Example Forecast1 First word The next word of the statement and its probability; for example: car 0.2523 Forecast2 2nd word.
  • the next word of the statement and its probability for example: person 0.2323 Forecast3
  • the next word of the statement and its probability for example: electricity 0.2023 ... ... ... ForecastN Nth word
  • the next word of the statement and its probability for example: learning 0.1923
  • step S17 using the natural language sentence library to verify whether the currently received statement is an abnormal statement may include the following steps:
  • Step S171 determining the number of characters included in the currently received statement and the verification direction of the currently received statement
  • Step S172 Calculate, in the natural language sentence library, a probability of each character included in the currently received statement according to the verification direction;
  • Step S173 calculating a probability that the currently received statement is a normal sentence according to the probability of each character.
  • the natural language sentence library can be used to calculate the probability of each word in the sentence input by the user and the average probability of the entire sentence;
  • the statement probability is calculated through the “probability interface” in the “NLP online interface”, and the interaction needs to be performed according to the specified data format;
  • the statement entered by the user is “I want to go to the mall to buy a pair of shoes”, and the form and return form of the message are as follows;
  • the DomainType is a domain type, for example, 001 is a general domain, 002 is a telecommunications field, 003 is a medical field..., and MiningForward is a prediction direction, for example, true is forward prediction, false is backward prediction, and Sentence is specific to be processed. Statement.
  • Probability is the probability of a sentence, that is, the probability that this statement is a correct statement, and the probability that Word1Prob is a word is normalized.
  • step S18 using the natural language sentence library to predict the characters that the currently received statement continues to appear may include the following steps:
  • Step S181 determining the number of characters included in the currently received statement, the verification direction of the currently received statement, and the number of candidate characters of the character to be predicted;
  • Step S182 calculating, in the natural language sentence library, the probability of each character included in the currently received statement according to the verification direction;
  • Step S183 calculating the probability of occurrence of each candidate character according to the probability of each character and the number of candidate characters to be predicted.
  • the probability of the next word of the sentence can be predicted according to the partial sentence, that is, the word probability of the next sentence can be calculated;
  • Example 1 I graduated from college and I am going to go to (*) every day.
  • the statement is predicted by the "predictive interface" in the "NLP online interface", and also needs to interact according to the specified data format;
  • the DomainType is a domain type, for example, 001 is a general domain, 002 is a telecommunications field, 003 is a medical field..., and MiningForward is a prediction direction, for example, true is forward prediction, false is backward prediction, and Sentence is specific to be processed.
  • ForecastNum is the number of prediction words, that is, how many predicted values are displayed
  • Forecast1 is the probability of words, which has been normalized.
  • a natural language sentence database based on the cyclic neural network is constructed, which is the data support for NLP by using the deep learning model of RNN.
  • the principle is: first collect the corpus and use the RNN model to train, and get the natural under the RNN model.
  • the language sentence library; the natural language sentence library is further used for natural language processing, wherein the natural language processing may include, but is not limited to, spelling error correction, input association, sentence judgment, and dialogue generation.
  • FIG. 2 is a schematic diagram of a process of applying a natural language sentence library to perform a spelling error correction process in accordance with a preferred embodiment of the present invention.
  • the sentence library for the statement to be corrected, it needs to be processed word by word.
  • the forward and reverse sentence libraries can be processed separately, wherein the principles of forward and reverse processing are basically the same, and the purpose is to improve the accuracy of error correction.
  • reverse processing is usually more accurate than forward processing; considering that accuracy is more important than recall, two-way processing is used to improve accuracy.
  • the first step is to process the ith word and generate "candidate error correction set" data according to the prediction interface, including adding, deleting, and replacing operations; for example, I want to buy a pair of shoes.
  • the first step of the above operation is performed on each word in the sentence, thereby obtaining a complete "candidate error correction set".
  • the fifth step shows that homophone errors are very common, and homophones will be processed separately in this step.
  • the original word and the N new words are converted into pinyin, and if there is one homophone new word, the statement is an error correction statement;
  • homophone A statement is an error correction result statement.
  • this step is an optional step.
  • the seventh step if the "candidate error correction set" is empty, the original statement is normal, and no error correction is needed; if the "candidate error correction set" has only one piece of data, the data is an error correction statement; If the error correction set data is greater than 2, no error correction is required.
  • FIG. 3 is a schematic diagram of an input association process performed using a natural language sentence library in accordance with a preferred embodiment of the present invention.
  • the sentence library in order to facilitate user input, when the user enters the first half of the sentence, the content of the latter sentence is automatically prompted;
  • the first step is to predict the next word according to the "first half statement" and select the K words with the highest probability.
  • the second step is to determine whether the above statement has reached the length limit; if the length limit is reached, proceed to the next step; if the length limit is not reached, add the currently predicted word to the "first half statement" and return to the first step, for example : Re-predict "I want to buy a TV about 1,000 yuan.”
  • the third step uses the Probability Interface to calculate all "Lenovo Filter Items” probabilities.
  • the fourth step selects the M options with the highest probability according to the probability to obtain the final associative words.
  • Example 3 of the sentence library judging sentences
  • FIG. 4 is a schematic diagram of a process of performing a sentence determination process using a natural language sentence library in accordance with a preferred embodiment of the present invention. As shown in FIG. 4, in the NLP processing process, it is sometimes necessary to determine whether the statement is a normal statement, that is, a sentence judgment.
  • the first step is to generate a statement forward probability according to the "probability interface"; if the probability is less than the threshold A, it is judged that the statement is not a normal statement, and the process ends; otherwise, the next step is continued.
  • the statement reverse probability is generated according to the “probability interface”; if the probability is less than the threshold B, it is judged that the statement is not a normal statement, and the process ends; otherwise, the next step is continued.
  • the third step is to perform a weighted summation operation on the generated forward probability and the reverse probability. If the calculated probability is less than the threshold C, it is determined that the statement is not a normal statement, and the process ends; otherwise, if the calculated probability is greater than Or equal to the threshold C, it is determined that the statement is a normal statement, and the process ends.
  • FIG. 5 is a schematic diagram of a process of generating a dialog using a natural language sentence library in accordance with a preferred embodiment of the present invention.
  • NLP process it is usually necessary to generate a statement;
  • question and answer system after understanding the user's intention, it is necessary to organize the statement according to the user's intention to reply;
  • the process may specifically include the following steps:
  • the first step is to determine the material generated by the conversation.
  • the material is “Capital, Beijing, Population, 20 million”
  • the second step is to arrange and combine the materials.
  • the result of the arrangement is one of the following:
  • the third step adds auxiliary words according to the "predictive interface" to generate a candidate statement.
  • the probability of the "to-be-selected statement" is calculated according to the "probability interface", which can filter the statement according to a preset threshold.
  • the fifth step is to select the appropriate dialogue according to the preset strategy.
  • Strategy 2 Randomly select different dialogues from the filtered sentences to increase the anthropomorphic effect of the question and answer.
  • the entire natural language sentence library is based on the RNN training model, the relationship between words and words, the relationship between words and sentences, and the relationship between words and contexts are all through the training.
  • the process gets.
  • the user can add new domain data or extend the corpus of the original domain;
  • "NLP online interface” mainly displays the service provided by the sentence library;
  • the service provided by the interface is basic text information, the user
  • Different NLP natural language applications can be executed based on the basic text information, such as spelling correction, input association, sentence judgment, and dialog generation.
  • the RNN neural network structure is the theoretical basis of the natural language sentence database.
  • the RNN neural network structure can fully utilize the sentence context information to construct the sentence library without manual intervention, thereby greatly reducing the workload of manual operations.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic).
  • the disc, the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
  • a device for generating a natural language sentence library is also provided, which is used to implement the above-mentioned embodiments and preferred embodiments, and has not been described again.
  • the term "module” may implement a combination of software and/or hardware of a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, but hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 6 is a flowchart of a device for generating a natural language sentence library according to an embodiment of the present invention.
  • the natural language sentence library is The generating device may include: an obtaining module 10 configured to acquire word information and a word vector according to the training data set; and the converting module 20 configured to convert the word information into a test set to be identified by using a word vector of a preset dimension; the generating module 30 , set to generate a natural language sentence library by training the test set to be identified in the cyclic neural network RNN model.
  • the RNN model may include: an input layer, a hidden layer, and an output layer, wherein the input layer is adjacent to the hidden layer, and the hidden layer is adjacent to the output layer.
  • FIG. 7 is a flowchart of a device for generating a natural language sentence library according to a preferred embodiment of the present invention.
  • the obtaining module 10 may include: a statistical unit 100 configured to set each character in the training data set. The frequency of occurrence is counted, wherein the characters include at least one of the following: a character, a number, a symbol; and the obtaining unit 102 is configured to set all characters whose appearance frequency is greater than a preset threshold as word information.
  • the generating module 30 may include: an extracting unit 300 configured to extract the number of hidden layers and the number of neurons in each hidden layer from the RNN model parameters configured for the RNN model, and the training data. Intercepting the length; the first calculating unit 302 is configured to calculate the number of neurons of the input layer according to the length of the training data and the preset dimension; the setting unit 304 is configured to set the nerve of the output layer according to the number of characters included in the word information a number of elements; a generating unit 306, configured to train each character of the test set to be identified according to the number of hidden layers and the number of neurons of each hidden layer, the number of neurons in the input layer, and the number of neurons in the output layer, Generate a natural language sentence library.
  • the foregoing apparatus may further include: a processing module 40 configured to use a natural language sentence library to verify whether the currently received statement is an abnormal statement; or use a natural language sentence library to predict the currently received The characters that appear in the statement.
  • a processing module 40 configured to use a natural language sentence library to verify whether the currently received statement is an abnormal statement; or use a natural language sentence library to predict the currently received The characters that appear in the statement.
  • the processing module 40 may include: a determining unit 400, configured to Determining the number of characters included in the currently received statement and the verification direction of the currently received statement; the second calculating unit 402 is configured to calculate each of the currently received statements according to the verification direction in the natural language sentence library. The probability of the character; the third calculating unit 404 is configured to calculate a probability that the currently received statement is a normal sentence according to the probability of each character.
  • the processing module 40 may further include: a determining unit 400, configured to determine a number of characters included in the currently received statement, a verification direction of the currently received statement, and a number of candidate characters to be predicted;
  • the calculating unit 402 is further configured to calculate a probability of each character included in the currently received statement according to the verification direction in the natural language sentence library;
  • the third calculating unit 404 is further set according to the probability of each character and the to be predicted
  • the number of candidate characters calculates the probability of occurrence of each candidate character.
  • each of the above modules may be implemented by software or hardware.
  • the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the modules are located in multiple In the processor.
  • modules or steps of the embodiments of the present invention can be implemented by a general computing device, which can be concentrated on a single computing device or distributed in multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from The steps shown or described are performed sequentially, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
  • Embodiments of the embodiments of the present invention further provide a storage medium, where the storage medium includes storage The program, wherein the method described above is executed while the program is running.
  • the foregoing storage medium may be configured to store program code for performing the following steps:
  • the storage medium is further configured to store program code for performing the following steps: obtaining the word information according to the training data set comprises:
  • S2 Sort characters whose frequency is greater than a preset threshold according to a preset order, to obtain the word information.
  • the foregoing storage medium may be configured to store program code for performing the following steps:
  • the RNN model includes: an input layer, a hidden layer, and an output layer, wherein the input layer is adjacent to the hidden layer, and the hidden layer is adjacent to the output layer.
  • the foregoing storage medium may be configured to store program code for performing the following steps: generating the natural language by training the test set to be identified in the RNN model
  • the sentence library includes:
  • the input The number of neurons of the layer and the number of neurons of the output layer train each character of the test set to be identified to generate the natural language sentence library.
  • the storage medium may be configured to store program code for performing the following steps: generating a natural language sentence by training the test set to be identified in the RNN model After the library, it also includes one of the following:
  • the foregoing storage medium may be configured to store program code for performing the following steps: verifying whether the currently received statement is the abnormal statement by using the natural language sentence library comprises:
  • the foregoing storage medium may be configured to store program code for performing the following steps: predicting, by using the natural language sentence library, the characters that appear in the currently received statement include:
  • the foregoing storage medium may include, but is not limited to, a USB flash drive, a Read-Only Memory (ROM), and a random access memory (Random).
  • Access Memory referred to as RAM
  • mobile hard disk disk or optical disk, and other media that can store program code.
  • An embodiment of the present invention further provides a processor for running a program, wherein the program is executed to perform the steps in any of the above methods.
  • the foregoing program is used to perform the following steps:
  • acquiring the word information according to the training data set includes:
  • S2 Sort characters whose frequency is greater than a preset threshold according to a preset order, to obtain the word information.
  • the foregoing program is used to perform the following steps:
  • the RNN model includes: an input layer, a hidden layer, and an output layer, wherein the input layer is adjacent to the hidden layer, and the hidden layer is adjacent to the output layer.
  • the foregoing procedure is used to perform the following steps: by training the test set to be identified in the RNN model, generating the natural language sentence library includes:
  • the foregoing program is configured to perform the following steps: after training the test set to be identified in the RNN model to generate a natural language sentence library, and further comprising:
  • the foregoing program is configured to: perform, by using the natural language sentence library, whether the currently received statement is the abnormal statement includes:
  • the foregoing program is configured to perform the following steps: predicting, by using the natural language sentence library, the characters that appear in the currently received statement include:
  • modules of the embodiments of the present invention or The steps may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of computing devices. Alternatively, they may be implemented by program code executable by the computing device. Thus, they may be stored in a storage device by a computing device, and in some cases, the steps shown or described may be performed in an order different than that herein, or they may be separately fabricated into individual integrated circuits. Modules, or multiple modules or steps of them, are implemented as a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
  • the word information is obtained according to the training data set, the word information is converted into the test set to be identified by using the word vector of the preset dimension, and the test set to be identified in the RNN model is trained.
  • the method of generating a natural language sentence library achieves high recognition rate and is easy to use, and can satisfy the effects of NLP business requirements such as question answering system, retrieval system, expert system, online customer service, mobile assistant, and personal assistant.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

本发明实施例提供了一种自然语言文句库的生成方法及装置,通过本发明实施例,采用根据训练数据集获取字信息;采用预设维数的字向量将字信息转换为待识别的测试集;通过在RNN模型中对待识别的测试集进行训练,生成自然语言文句库的方式,解决了相关技术中构建自然语言文句库的方式需要大量人工干预,操作过程较为复杂的问题,进而达到了识别率高,简单易用,能够满足问答系统、检索系统、专家系统、在线客服、手机助手、私人助理等NLP业务需求的效果。

Description

自然语言文句库的生成方法及装置 技术领域
本发明实施例涉及计算机领域,具体而言,涉及一种自然语言文句库的生成方法及装置。
背景技术
随着计算机和网络科技的发展,人们在日常工作和生活中,随处可以碰到人工智能(AI)的应用,而人工智能又与文本文字的处理密切相关,即自然语言处理(NLP)。进一步地,考虑到以文字(非词语)为基础的自然语言文句库又是自然语言处理的基础,因此,构建字的基础文句库十分必要,其原因在于:该基础文句库能够支撑自然语言处理的诸多任务,例如:拼写纠错、输入联想、成句判断、对话生成等。
在大数据和人工智能高速发展的今天,文字作为一种重要的信息载体发挥着不可替代的作用,准确地处理文字信息,能够提高NLP系统的服务质量,改善用户体验,是自然语言理解范畴中的一个重要课题,其研究迫在眉睫。
发明内容
本发明实施例提供了一种自然语言文句库的生成方法及装置,以至少解决相关技术中构建自然语言文句库的方式需要大量人工干预,操作过程较为复杂的问题。
根据本发明实施例的一个方面,提供了一种自然语言文句库的生成方法,包括:
根据训练数据集获取字信息;采用预设维数的字向量将字信息转换为待识别的测试集;通过在循环神经网络(RNN)模型中对待识别的测试集 进行训练,生成自然语言文句库。
在本发明实施例中,根据训练数据集获取字信息包括:对训练数据集中每个字符的出现频率进行统计,其中,字符包括以下至少之一:文字、数字、符号;将出现频率大于预设阈值的字符按照预设顺序进行排序,得到字信息。
在本发明实施例中,RNN模型包括:输入层、隐藏层和输出层,其中,输入层与隐藏层相邻,隐藏层与输出层相邻。
在本发明实施例中,通过在RNN模型中对待识别的测试集进行训练,生成自然语言文句库包括:从为RNN模型配置的RNN模型参数中提取隐藏层的个数以及每个隐藏层的神经元数目,训练数据截取长度;根据训练数据截取长度和预设维数计算得到输入层的神经元数目;按照字信息所包含字符的个数设置输出层的神经元数目;根据隐藏层的个数以及每个隐藏层的神经元数目,输入层的神经元数目和输出层的神经元数目对待识别的测试集中的每个字符进行训练,生成自然语言文句库。
在本发明实施例中,在通过在RNN模型中对待识别的测试集进行训练,生成自然语言文句库之后,还包括以下之一:采用自然语言文句库验证当前接收到的语句是否为异常语句;采用自然语言文句库预测当前接收到的语句接续出现的字符。
在本发明实施例中,采用自然语言文句库验证当前接收到的语句是否为异常语句包括:确定当前接收到的语句所包含的字符个数以及当前接收到的语句的验证方向;在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;根据每个字符的概率计算述当前接收到的语句为正常语句的概率。
在本发明实施例中,采用自然语言文句库预测当前接收到的语句接续出现的字符包括:确定当前接收到的语句所包含的字符个数,当前接收到的语句的验证方向以及待预测字符的备选字符的数量;在自然语言文句库中按照验证方向计算当前接收到的语句所包含的每个字符的概率;根据每 个字符的概率和待预测的备选字符的数量计算每个备选字符的出现概率。
根据本发明实施例的另一方面,提供了一种自然语言文句库的生成装置,包括:
获取模块,设置为根据训练数据集获取字信息和字向量;转换模块,设置为采用预设维数的字向量将字信息转换为待识别的测试集;生成模块,设置为通过在循环神经网络RNN模型中对待识别的测试集进行训练,生成自然语言文句库。
在本发明实施例中,获取模块包括:统计单元,设置为对训练数据集中每个字符的出现频率进行统计,其中,字符包括以下至少之一:文字、数字、符号;第一获取单元,设置为将出现频率大于预设阈值的字符按照预设顺序进行排序,得到字信息。
在本发明实施例中,RNN模型包括:输入层、隐藏层和输出层,其中,输入层与隐藏层相邻,隐藏层与输出层相邻。
在本发明实施例中,生成模块包括:提取单元,设置为从为RNN模型配置的RNN模型参数中提取隐藏层的个数以及每个隐藏层的神经元数目,训练数据截取长度;第一计算单元,设置为根据训练数据截取长度和预设维数计算得到输入层的神经元数目;设置单元,设置为按照字信息所包含字符的个数设置输出层的神经元数目;生成单元,设置为根据隐藏层的个数以及每个隐藏层的神经元数目,输入层的神经元数目和输出层的神经元数目对待识别的测试集中的每个字符进行训练,生成自然语言文句库。
在本发明实施例中,上述装置还包括:处理模块,设置为采用自然语言文句库验证当前接收到的语句是否为异常语句;或者,采用自然语言文句库预测当前接收到的语句接续出现的字符。
在本发明实施例中,处理模块包括:确定单元,设置为确定当前接收到的语句所包含的字符个数以及当前接收到的语句的验证方向;第二计算单元,设置为在自然语言文句库中按照验证方向计算当前接收到的语句所包含的每个字符的概率;第三计算单元,设置为根据每个字符的概率计算 述当前接收到的语句为正常语句的概率。
在本发明实施例中,处理模块包括:确定单元,设置为确定当前接收到的语句所包含的字符个数,当前接收到的语句的验证方向以及待预测的备选字符的数量;第二计算单元,设置为在自然语言文句库中按照验证方向计算当前接收到的语句所包含的每个字符的概率;第三计算单元,设置为根据每个字符的概率和待预测的备选字符的数量计算每个备选字符的出现概率。
根据本发明实施例的又一个实施例,还提供了一种存储介质,所述存储介质包括存储的程序,其中,所述程序运行时执行上述任一项所述的方法。
通过本发明实施例,采用根据训练数据集获取字信息,采用预设维数的字向量将字信息转换为待识别的测试集以及通过在RNN模型中对待识别的测试集进行训练,生成自然语言文句库的方式,解决了相关技术中构建自然语言文句库的方式需要大量人工干预,操作过程较为复杂的问题,进而达到了识别率高,简单易用,能够满足问答系统、检索系统、专家系统、在线客服、手机助手、私人助理等NLP业务需求的效果。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的自然语言文句库的生成方法的流程图;
图2是根据本发明优选实施例的应用自然语言文句库执行拼写纠错过程的示意图;
图3是根据本发明优选实施例的应用自然语言文句库执行输入联想过程的示意图;
图4是根据本发明优选实施例的应用自然语言文句库执行成句判断过 程的示意图;
图5是根据本发明优选实施例的应用自然语言文句库执行对话生成过程的示意图;
图6是根据本发明实施例的自然语言文句库的生成装置的流程图;
图7是根据本发明优选实施例的自然语言文句库的生成装置的流程图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
需要说明的是,本发明实施例的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
实施例1
在本实施例中提供了一种自然语言文句库的生成方法,图1是根据本发明实施例的自然语言文句库的生成方法的流程图,如图1所示,该流程包括如下步骤:
步骤S12,根据训练数据集获取字信息;
步骤S14,采用预设维数的字向量将字信息转换为待识别的测试集;
步骤S16,通过在循环神经网络(Recursive neural networks,简称为RNN)模型中对待识别的测试集进行训练,生成自然语言文句库。
通过上述步骤,提供了一种自然语言文句库的构建方法,其主要面向问答系统、检索系统、专家系统、在线客服、手机助手、私人助理等,对中文自然语言进行处理;尤其适用于自然语言处理NLP、人工智能AI、智能问答、文本挖掘等领域。本发明实施例所提供的自然语言文句库是基于深度学习的RNN循环神经网络模型来实现的,其充分利用上下文中蕴含的大量信息,总结文字与文字之间的关系,文字与语句之间的关系,文字与字符之间的关系,字符与字符之间的关系以及文字与上下文之间的关 系,创建自然语言基础知识数据,用于支撑多种NLP自然语言处理任务;即采用根据训练数据集获取字信息和字向量,并通过在RNN模型中对训练数据集、字信息、字向量以及预先配置的RNN模型参数进行训练,生成自然语言文句库的方式,从而解决了相关技术中构建自然语言文句库的方式需要大量人工干预,操作过程较为复杂的问题,进而达到了人工干预少,简单快捷,易于实施,且准确率高的效果。
在优选实施过程中,上述RNN模型可以包括:输入层、隐藏层和输出层,其中,输入层与隐藏层相邻,隐藏层与输出层相邻。
预先配置的RNN模型参数可以包括但不限于:
(1)输入层的神经元数目I,例如:I=50;
(2)隐藏层的个数以及每个隐藏层所包含的神经元数目,例如:当前存在3个隐藏层,其分别为:H1,H2,H3,其中,H1代表第1个隐藏层,H2代表第2个隐藏层,H3代表第3个隐藏层;上述隐藏层参数可以根据过往经验来进行设置,例如:H1=200,H2=800,H3=3200,…;
(3)输出层的神经元数目K,K还可以表示字信息的个数,例如:K=5000;
(4)训练数据的截取长度W;
上述训练数据集可以为大量文本数据;假设选取N(例如:N=20万)条门户网站(例如:腾讯)发布的新闻信息,其数量越多越好。
在具体实施过程中,可以对选取的新闻信息进行分段,每段一行,当然,也可以不分段,每篇文章一行,并将结果存储在data.txt文本中,其形式如下:
但是自建自住的如果出租,甚至于也有转让的可能性,不管采取什么形式都有转让。
同时亦要求台当局行政院在重大政策出门前,必须寻求党与政的最大共识。
他总是在我身边嚷嚷,我怎么睡得着呢,娄淑元醒来后精神很好。
在油价波动方向不确定、且波动幅度较大的情况下,适当收取燃油附加费是应对油价波动的有效措施。
超市老板娘潘某也发现钙奶里有一股煤气的味道,于是给小诺换了一瓶。
张颖不但自称中央电视台记者,还恐吓他要曝光,并踢了他几脚。
可选地,在步骤S12中,根据训练数据集获取字信息可以包括以下执行步骤:
步骤S121,对训练数据集中每个字符的出现频率进行统计,其中,字符包括以下至少之一:文字、数字、符号;
步骤S122,将出现频率大于预设阈值的字符按照预设顺序进行排序,得到字信息。
通过统计字信息(注:这里不是词信息),为后续的RNN训练及RNN文句库的使用做准备。具体地,经过对训练数据集进行统计,总共有5000多个常用字。这些字都是常用的简体字、数字、符号等。
为了方便,在后续描述中,设置字信息个数K=5000;
假设其编号如下:
......
114:上
115:下
116:不
117:与
118:且
119:丑
120:专
......
需要说明的是,前边的数字并非是指字的个数,而是指字的编号,该编号从0开始按照从小到大的顺序依次排列;在编号确定后,每个字对应的编号在训练和测试的过程中将不再改变。
由于中文常用字总体是不会变化的,也很少创造新字,因此,此处的“字信息”一旦确定后,不但训练医药领域的信息可以使用,而且训练其他任何领域的文字信息都可以使用,例如:通信领域、机械领域。
准备字向量,其与词向量的原理相近似,使用本技术领域通用的形式即可;
此处训练的向量长度(即上述预设维数)是S,假设S=50,即为50维的字向量;
字向量的形式如下:
上[0.1525,0.3658,0.1634,0.2510,…]
下[0.1825,0.2658,0.1834,0.2715,…]
在本发明的优选实施例中,选用字向量而并非使用词向量的原因在于:字的数量较少且相对稳定,而词的数量较多且不断涌现新生词汇。特别是,将RNN文句库用于“拼写纠错”时,语句本身可能有错别字,会影响分词的准确率,故而本发明优选实施例使用字向量而不使用词向量。
同理,由于中文常用字总体是不会变化的,也很少创造新字,故此处的“字向量”一旦确定后,不但训练医药领域的信息可以使用,而且训练其他任何领域的文字信息都可以使用,例如:通信领域、机械领域。
可选地,在步骤S16中,通过在RNN模型中对待识别的测试集进行训练,生成自然语言文句库可以包括以下执行步骤:
步骤S161,从为RNN模型配置的RNN模型参数中提取隐藏层的个数以及每个隐藏层的神经元数目,训练数据截取长度;
步骤S162,根据训练数据截取长度和预设维数计算得到输入层的神 经元数目;
步骤S163,按照字信息所包含字符的个数设置输出层的神经元数目;
步骤S164,根据隐藏层的个数以及每个隐藏层的神经元数目,输入层的神经元数目和输出层的神经元数目对待识别的测试集中的每个字符进行训练,生成自然语言文句库。
通过将训练数据集和预先配置的RNN模型参数输入至RNN模型中,反复训练,直到参数变化小于X(该参数可配置,例如:X=0.001)为止,进而通过RNN训练将会得到N个权重矩阵,即RNN模型下的文句库,然后,再将RNN权重矩阵保存为一个二进制文件weight.rnn,供后续自然语言处理时使用。当然,为了提高文句库的准确性,训练时需要分别进行正向训练和反向训练,从而得到两套权重系数。
具体地,本发明实施例是基于RNN循环神经网络实现的一种文句库,其为一种基于深度学习(DL)的文句库。
根据RNN循环神经网络的具体结构可以确定包括以下部分:
(1)输入层
假设输入X个字,即I=X*字向量的维数;
当X=1,字向量的维数为50,则I=50;
当X=2,字向量的维数为50,则I=100;
在优选实施例中,选用单字进行训练,即X=1,I=50;
此处,需要说明的是,X的取值越大,训练的RNN模型越准确,但是训练的工作量也随之增大。
(2)隐藏层
此处需要确定隐藏层个数与每个隐藏层的神经元数量。
假设有3个隐藏层的RNN模型,则可以设置H1=200,H2=800,H3=2000,H可以根据过往经验来进行设置;在优选实施过程中,设置1个隐藏层,其神经元的数目为H=200。
另外,需要说明的是,隐藏层的个数越多,训练的模型越准确,但是训练的工作量也随之增大。
(3)输出层
输出层的神经元数量K可以与字信息的数目保持一致;
在优选实施过程中,设置为K=5000。
上述训练过程的目的在于获取N对权重矩阵,N的取值大小与RNN隐藏层的数量相关;
当有1个隐藏层时,将会得到N=3个权重矩阵;
当有2个隐藏层时,将会得到N=5个权重矩阵;
当有3个隐藏层时,将会得到N=7个权重矩阵;
假设当前设置1个隐藏层,其神经元的数目为200,1个输入层,其神经元的数目为50以及1个输出层,其神经元的数目为5000;另外,采用i表示输入层,h表示隐藏层,k表示输出层,则由此可以得到3个权重矩阵:
(1)Wih输入层-隐藏层的权重矩阵;
(2)Whk隐藏层-输出层的权重矩阵;
(3)Whh隐藏层-隐藏层的权重矩阵;
由于使用双向训练,故最终获得的权重矩阵个数为2N。
另外,字信息是在RNN计算误差时使用,字向量是在“训练数据集”的字转换为计算机能够识别的数字信息时使用。
通过上述介绍,隐藏层的神经元的计算公式如下:
Figure PCTCN2017097213-appb-000001
θh是激活函数I输入节点个数 H隐藏层神经元数
输出层的神经元的计算公式如下:
Figure PCTCN2017097213-appb-000003
Figure PCTCN2017097213-appb-000004
θk是Softmax函数H是隐藏层神经元个数 K是输出层神经元个数
通过反复训练(假设训练2000遍),其目的在于:得到N对权重矩阵,即W权重文件。
权重文件的格式为:P-Q,X-Y,W;
其中,P为上层神经网络序号,Q为下层神经网络序号,X为上层神经元序号,Y为下层神经元序号,W为RNN模型中两个不同神经元序号之间连接对应的权值;
其对应的示例如下:
0-1,1-1,0.3415
0-1,1-2,0.5415
0-1,1-3,0.6415
1-2,1-1,0.4715
1-2,1-2,0.5415
1-2,1-3,0.6415
2-2,1-1,0.7415
2-2,1-2,0.8415
2-2,1-3,0.9015
……
可选地,在步骤S16,通过在RNN模型中对待识别的测试集进行训练,生成自然语言文句库之后,还可以包括以下执行步骤之一:
步骤S17,采用自然语言文句库验证当前接收到的语句是否为异常语句;
步骤S18,采用自然语言文句库预测当前接收到的语句接续出现的字符。
对自然语言文句库的使用可以通过“NLP在线接口”来进行,即,通过NLP在线接口可以实现拼写纠错、输入联想、成句判断、对话生成等自然语言处理功能;其中,NLP在线接口还可以包括:概率接口和预测接口,这两种的发送消息格式与接收消息格式分别如下:
(1)“概率接口”发送的消息格式如表1所示,
表1
名称 说明 举例
DomainType 领域 通用领域001;医药领域002;电信领域003;等
TrainForward 方向 ture是正向预测,false是反向预测;
Sentence 语句 具体要处理的语句:你叫什么名子?
(2)“概率接口”返回的消息格式如表2所示,
表2
名称 说明 举例
Probability 语句概率 成句的概率,比如0.4501;
Word1Prob 字1概率 第1个字的概率,比如0.2536
Word2Prob 字2概率 第2个字的概率,比如0.3536
Word3Prob 字3概率 第3个字的概率,比如0.2736
WordNProb 字N概率 第N个字的概率,比如0.5636
(3)“预测接口”发送的消息格式如表3所示,
表3
名称 说明 举例
DomainType 领域 通用领域001;医药领域002;电信领域003;等
TrainForward 方向 ture是正向预测,false是方向预测;
Sentence 语句 具体要处理的语句:你叫什么名字?
ForecastNum 预测字数 预测语句下一个字时,显示的字个数N;
(4)“预测接口”返回的消息格式如表4所示,
表4
名称 说明 举例
Forecast1 第1个字 语句下一个字及其概率;比如:车0.2523
Forecast2 第2个字 语句下一个字及其概率;比如:人0.2323
Forecast3 第3个字 语句下一个字及其概率;比如:电0.2023
ForecastN 第N个字 语句下一个字及其概率;比如:学0.1923
可选地,在步骤S17中,采用自然语言文句库验证当前接收到的语句是否为异常语句可以包括以下执行步骤:
步骤S171,确定当前接收到的语句所包含的字符个数以及当前接收到的语句的验证方向;
步骤S172,在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;
步骤S173,根据每个字符的概率计算述当前接收到的语句为正常语句的概率。
在生成自然语言文句库之后,可以通过自然语言文句库计算用户输入的语句中各个字的概率和整个句子的平均概率;
例如:我爱我们的祖国。
其中,各个字的概率分别为:
<我0.000><爱0.0624><我0.2563><们0.2652><的0.2514><祖0.2145><国0.2145>
整个语句的平均概率为:0.2850
在优选实施过程中,通过“NLP在线接口”中的“概率接口”对语句概率进行计算,需要按照规定的数据格式进行交互;
例如,用户录入的语句是“我想去商场买双鞋”,其消息的发送形式和返回形式分别如下;
“概率接口”发送消息的格式如下
Figure PCTCN2017097213-appb-000005
其中,DomainType为领域类型,例如:001为通用领域、002为电信领域、003为医药领域…,TrainForward为预测方向,例如:true为正向预测、false为反向预测,Sentence为具体要处理的语句。
“概率接口”返回消息的格式如下
Figure PCTCN2017097213-appb-000006
Figure PCTCN2017097213-appb-000007
其中,Probability为成句概率,即这个语句是正确语句的概率,Word1Prob为字的概率,经过了归一化处理。
可选地,在步骤S18中,采用自然语言文句库预测当前接收到的语句接续出现的字符可以包括以下执行步骤:
步骤S181,确定当前接收到的语句所包含的字符个数,当前接收到的语句的验证方向以及待预测字符的备选字符的数量;
步骤S182,在自然语言文句库中按照验证方向计算当前接收到的语句所包含的每个字符的概率;
步骤S183,根据每个字符的概率和待预测的备选字符的数量计算每个备选字符的出现概率。
在生成自然语言文句库之后,可以根据部分语句预测句子的下一个字的概率,即可以计算出句子下一个的字概率;
例如:我爱我们的祖(*)。
<国0.6012>
<先0.2017>
<宗0.0254>
……
由于文句库充分利用了语句上下文信息,故其概率统计会根据具体语境的变化而有所不同,其具体示例如下:
示例一、我大学毕业了,我每天要去上(*)。
<班0.2412>
<学0.1017>
示例二、我今年八岁了,我每天要去上(*)。
<班0.1016>
<学0.1517>
此处需要说明的是,在训练过程中可以分两个方向;一个是正向,即从左到右;一个是反向,从右到左,即双向训练,得到两组权重矩阵,其目的在于:提高NLP处理的准确性。
在优选实施过程中,通过“NLP在线接口”中的“预测接口”对语句进行预测,同样需要按照规定的数据格式进行交互;
例如:用户录入的语句“社会主义国”,其消息的发送形式和返回形式分别如下;
“预测接口”发送消息的格式如下
Figure PCTCN2017097213-appb-000008
其中,DomainType为领域类型,例如:001为通用领域、002为电信领域、003为医药领域…,TrainForward为预测方向,例如:true为正向预测、false为反向预测,Sentence为具体要处理的语句,ForecastNum为预测字的数量,即显示多少预测的值
“预测接口”返回消息的格式如下:
Figure PCTCN2017097213-appb-000009
Figure PCTCN2017097213-appb-000010
其中,Forecast1为字的概率,其经过了归一化处理。
通过上述分析,构建一个基于循环神经网络的自然语言文句库,即通过使用RNN的深度学习模型为NLP做数据支撑,其原理在于:先收集语料并使用RNN模型进行训练,得到RNN模型下的自然语言文句库;再利用该自然语言文句库进行自然语言处理,其中,自然语言处理可以包括但不限于:拼写纠错、输入联想、成句判断、对话生成。
下面将结合以下优选实施方式对上述优选实施过程作进一步地描述。
文句库使用示例一:拼写纠错
图2是根据本发明优选实施例的应用自然语言文句库执行拼写纠错过程的示意图。如图2所示,在该文句库使用示例中,对于待纠错语句,需要逐字进行处理。在处理过程中,可以分别采用正向和反向文句库进行处理,其中,正向和反向处理的原理基本相同,其目的在于:提高纠错的准确率。
在自然语言处理的过曾中,通常反向处理比正向处理更为准确;考虑到准确率比召回率更为重要,故而使用双向处理,以便提高准确率。
例如:我想去商厂买一双鞋子
当处理第i(例如:i=5)个字“厂”时,可以分为三种情况,即替换、添加、删除;选取概率大的新字;并加入到“候选纠错集”,具体可以包括以下处理步骤:
第一步、处理第i个字,根据预测接口生成“候选纠错集”数据,包括添加、删除、替换操作;例如:我想去商<>买一双鞋子。
我想去商<店>买一双鞋子
我想去商<场>买一双鞋子
我想去商厂<门>买一双鞋子
……
第二步、对句子中每个字均执行上述第一步操作,进而得到完整的“候选纠错集”。
第三步、计算句子双向的平均概率,选取概率较大的N(例如:N=30)个新语句。
此处需要句子的正向和反向平均概率都比较大才能被选取。
第四步、假设平均概率最大的句子的概率为P1,平均概率第二大的句子的概率为P2,如果P1>P2+X,其中,X可配置,根据过往经验可以设定X=0.2,则P1语句即是纠错的语句。
另一种表达方式为:概率最大的语句(P1)远远大于概率第二的语句(P2)。
第五步、根据研究结果表明,同音字错误的情况十分常见,在该步骤中将单独处理同音字。
具体地,将原始字和N个新字都转换为拼音,如果存在1个同音新字,则该语句即为纠错语句;
例如:原始语句:我想去商厂买一双鞋子
新的语句:我想去商场买一双鞋子——A
我想去商店买一双鞋子——B
则通常认为同音的A语句是纠错结果语句。
需要说明的是,该步骤为可选步骤。
第六步、可以判断“形近字”、“常错字”等,其主要目的在于:辅助筛选。
第七步、如果“候选纠错集”为空,则说明原始语句正常,无需纠错;如果“候选纠错集”只有1条数据,则该数据即是纠错语句;如果“候选 纠错集”数据大于2,则无需纠错。
需要说明的是,对于拼写纠错而言,考虑到准确率相对于召回率更为重要,故而在无法判别那个是纠错语句时,便可以先行决定放弃纠错。
文句库使用示例二:输入联想
图3是根据本发明优选实施例的应用自然语言文句库执行输入联想过程的示意图。如图3所示,在该文句库使用示例中,为了便于用户录入,当用户录入语句的前半句时,自动提示后半句的内容;
例如;用户录入的语句为“我想买1000元左右的电”时,后边直接提示“视机”或“脑”,可以利用“NLP在线接口”实现本功能,具体可以包括以下执行步骤:
第一步、根据“前半语句”预测后续的一个字,选取概率最高的K个字。
例如:预测“我想买1000元左右的电”下一个字以及该字对应的概率;
<视0.3>
<脑0.28>
<动0.1>
第二步、判断上述语句是否已经达到长度限制;如果达到长度限制,则继续执行下一步;如果没有达到长度限制,则将当前预测的字添加到“前半语句”,返回执行第一步,例如:重新预测“我想买1000元左右的电视”。
第三步利用“概率接口”计算所有“联想筛选项”概率。
例如:
我想买1000元左右的电<脑>0.6502
我想买1000元左右的电<动车>0.6421
我想买1000元左右的电<视机>0.5241
……
第四步根据概率选取概率最大的M个选项,得到最终联想词。
文句库使用示例三:成句判断
图4是根据本发明优选实施例的应用自然语言文句库执行成句判断过程的示意图。如图4所示,在NLP处理过程中,有时需要判断语句是否为正常语句,即成句判断。
例如:在智能问答系统中,有些用户通常随意录入一些“不成语句”的语句,来验证系统的分析能力;如何识别这些语句,此处便需要使用到“成句判断”功能。
假设用户录入的语句为“来啊爱的到量开太噢同”,如何能够判断该语句是否为正常语句,则可以回复“请说人话”,或者“你的表达很深奥啊”等回复;其可以包括以下执行步骤:
第一步、根据“概率接口”生成语句正向概率;如果概率小于阈值A,则判断该语句不是正常的语句,流程结束;否则,继续执行下一步。
第二步、根据“概率接口”生成语句反向概率;如果概率小于阈值B,则判断该语句不是正常的语句,流程结束;否则,继续执行下一步。
第三步、对上述生成的正向概率和反向概率进行加权求和运算,如果计算出的概率小于阈值C,则判断该语句不是正常的语句,流程结束;否则,如果计算出的概率大于或等于阈值C,则确定该语句为正常语句,流程结束。
文句库使用示例四:对话生成
图5是根据本发明优选实施例的应用自然语言文句库执行对话生成过程的示意图。如图5所示,在NLP处理过程中,通常需要生成语句;例 如,在问答系统中,当理解用户的意图后,需要根据用户的意图组织语句进行答复;
例如:首都北京人口
理解:通过查询得到——首都北京人口2000万
生成对话:首都北京的人口是2000万左右
该过程具体可以包括以下执行步骤:
第一步、确定对话生成的素材。
例如:素材为“首都、北京、人口、2000万”
第二步、对素材进行排列组合。
排列组合结果为以下之一:
首都、北京、人口、2000万
北京、首都、人口、2000万
人口、首都、北京、2000万
首都、人口、北京、2000万
……
第三步根据“预测接口”添加辅助词,生成待选语句。
例如:
首都、的、北京、人口、有、2000万,左右
北京、首都、的、人口、2000万
人口、首都、是、北京、2000万
首都、人口、有、北京、2000万,多
……
第四步、根据“概率接口”计算“待选语句”概率,其可以根据预先设定的阈值,对语句进行筛选。
第五步、根据预设策略选择合适的对话。
例如:
策略一、选择概率最高的对话,增加问答的准确性;
策略二、随机从筛选后的语句中选择不同的对话,增加问答的拟人效果。
通过上述优选实施例,整个自然语言文句库都是以RNN训练模型为基础,文字与文字之间的关系,文字与语句之间的关系,以及文字与上下文之间的关系,都是通过该训练流程得到的。为了提高文句库的可扩展性,用户可以添加新的领域数据或扩展原有领域的语料;“NLP在线接口”主要是展示文句库对外提供的服务;该接口提供的服务为基本文字信息,用户可以根据该基本文字信息执行不同的NLP自然语言应用,例如:拼写纠错、输入联想、成句判断、对话生成。另外,RNN神经网络结构是自然语言文句库的理论基础,RNN神经网络结构能够充分地利用语句上下文信息来构造文句库,且不需要人工干预,从而能够大量地减少人工操作的工作量。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
在本实施例中还提供了一种自然语言文句库的生成装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以 下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
在本实施例中还提供了一种自然语言文句库的生成装置,图6是根据本发明实施例的自然语言文句库的生成装置的流程图,如图6所示,该自然语言文句库的生成装置可以包括:获取模块10,设置为根据训练数据集获取字信息和字向量;转换模块20,设置为采用预设维数的字向量将字信息转换为待识别的测试集;生成模块30,设置为通过在循环神经网络RNN模型中对待识别的测试集进行训练,生成自然语言文句库。
在优选实施过程中,上述RNN模型可以包括:输入层、隐藏层和输出层,其中,输入层与隐藏层相邻,隐藏层与输出层相邻。
可选地,图7是根据本发明优选实施例的自然语言文句库的生成装置的流程图,如图7所示,获取模块10可以包括:统计单元100,设置为对训练数据集中每个字符的出现频率进行统计,其中,字符包括以下至少之一:文字、数字、符号;获取单元102,设置为将出现频率大于预设阈值的全部字符设置为字信息。
可选地,如图7所示,生成模块30可以包括:提取单元300,设置为从为RNN模型配置的RNN模型参数中提取隐藏层的个数以及每个隐藏层的神经元数目,训练数据截取长度;第一计算单元302,设置为根据训练数据截取长度和预设维数计算得到输入层的神经元数目;设置单元304,设置为按照字信息所包含字符的个数设置输出层的神经元数目;生成单元306,设置为根据隐藏层的个数以及每个隐藏层的神经元数目,输入层的神经元数目和输出层的神经元数目对待识别的测试集中的每个字符进行训练,生成自然语言文句库。
可选地,如图7所示,上述装置还可以包括:处理模块40,设置为采用自然语言文句库验证当前接收到的语句是否为异常语句;或者,采用自然语言文句库预测当前接收到的语句接续出现的字符。
可选地,如图7所示,处理模块40可以包括:确定单元400,设置为 确定当前接收到的语句所包含的字符个数以及当前接收到的语句的验证方向;第二计算单元402,设置为在自然语言文句库中按照验证方向计算当前接收到的语句所包含的每个字符的概率;第三计算单元404,设置为根据每个字符的概率计算述当前接收到的语句为正常语句的概率。
可选地,处理模块40可以包括:确定单元400,还设置为确定当前接收到的语句所包含的字符个数,当前接收到的语句的验证方向以及待预测的备选字符的数量;第二计算单元402,还设置为在自然语言文句库中按照验证方向计算当前接收到的语句所包含的每个字符的概率;第三计算单元404,还设置为根据每个字符的概率和待预测的备选字符的数量计算每个备选字符的出现概率。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述模块分别位于多个处理器中。
显然,本领域的技术人员应该明白,上述的本发明实施例的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明实施例不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
实施例3
本发明实施例的实施例还提供了一种存储介质,该存储介质包括存储 的程序,其中,上述程序运行时执行上述任一项所述的方法。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:
S1,根据训练数据集获取字信息;
S2,采用预设维数的字向量将所述字信息转换为待识别的测试集;
S3,通过在循环神经网络RNN模型中对所述待识别的测试集进行训练,生成自然语言文句库。
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:根据所述训练数据集获取所述字信息包括:
S1,对所述训练数据集中每个字符的出现频率进行统计,其中,所述字符包括以下至少之一:文字、数字、符号;
S2,将出现频率大于预设阈值的字符按照预设顺序进行排序,得到所述字信息。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:
S1,RNN模型包括:输入层、隐藏层和输出层,其中,所述输入层与所述隐藏层相邻,所述隐藏层与所述输出层相邻。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:通过在所述RNN模型中对所述待识别的测试集进行训练,生成所述自然语言文句库包括:
S1,从为所述RNN模型配置的RNN模型参数中提取隐藏层的个数以及每个隐藏层的神经元数目,训练数据截取长度;
S2,根据所述训练数据截取长度和所述预设维数计算得到所述输入层的神经元数目;
S3,按照所述字信息所包含字符的个数设置所述输出层的神经元数目;
S4,根据所述隐藏层的个数以及每个隐藏层的神经元数目,所述输入 层的神经元数目和所述输出层的神经元数目对所述待识别的测试集中的每个字符进行训练,生成所述自然语言文句库。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:在通过在所述RNN模型中对所述待识别的测试集进行训练,生成自然语言文句库之后,还包括以下之一:
S1,采用所述自然语言文句库验证当前接收到的语句是否为异常语句;
S2,采用所述自然语言文句库预测当前接收到的语句接续出现的字符。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:采用所述自然语言文句库验证所述当前接收到的语句是否为所述异常语句包括:
S1,确定所述当前接收到的语句所包含的字符个数以及所述当前接收到的语句的验证方向;
S2,在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;
S3,根据每个字符的概率计算述当前接收到的语句为正常语句的概率。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:采用所述自然语言文句库预测所述当前接收到的语句接续出现的字符包括:
S1,确定所述当前接收到的语句所包含的字符个数,所述当前接收到的语句的验证方向以及待预测的备选字符的数量;
S2,在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;
S3,根据每个字符的概率和所述待预测的备选字符的数量计算每个备选字符的出现概率。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random  Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本发明实施例的实施例还提供了一种处理器,该处理器用于运行程序,其中,该程序运行时执行上述任一项方法中的步骤。
可选地,在本实施例中,上述程序用于执行以下步骤:
S1,根据训练数据集获取字信息;
S2,采用预设维数的字向量将所述字信息转换为待识别的测试集;
S3,通过在循环神经网络RNN模型中对所述待识别的测试集进行训练,生成自然语言文句库。
可选地,在本实施例中,上述程序用于执行以下步骤:根据所述训练数据集获取所述字信息包括:
S1,对所述训练数据集中每个字符的出现频率进行统计,其中,所述字符包括以下至少之一:文字、数字、符号;
S2,将出现频率大于预设阈值的字符按照预设顺序进行排序,得到所述字信息。
可选地,在本实施例中,上述程序用于执行以下步骤:
S1,RNN模型包括:输入层、隐藏层和输出层,其中,所述输入层与所述隐藏层相邻,所述隐藏层与所述输出层相邻。
可选地,在本实施例中,上述程序用于执行以下步骤:通过在所述RNN模型中对所述待识别的测试集进行训练,生成所述自然语言文句库包括:
S1,从为所述RNN模型配置的RNN模型参数中提取隐藏层的个数以及每个隐藏层的神经元数目,训练数据截取长度;
S2,根据所述训练数据截取长度和所述预设维数计算得到所述输入层的神经元数目;
S3,按照所述字信息所包含字符的个数设置所述输出层的神经元数目;
S4,根据所述隐藏层的个数以及每个隐藏层的神经元数目,所述输入层的神经元数目和所述输出层的神经元数目对所述待识别的测试集中的每个字符进行训练,生成所述自然语言文句库。
可选地,在本实施例中,上述程序用于执行以下步骤:在通过在所述RNN模型中对所述待识别的测试集进行训练,生成自然语言文句库之后,还包括以下之一:
S1,采用所述自然语言文句库验证当前接收到的语句是否为异常语句;
S2,采用所述自然语言文句库预测当前接收到的语句接续出现的字符。
可选地,在本实施例中,上述程序用于执行以下步骤:采用所述自然语言文句库验证所述当前接收到的语句是否为所述异常语句包括:
S1,确定所述当前接收到的语句所包含的字符个数以及所述当前接收到的语句的验证方向;
S2,在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;
S3,根据每个字符的概率计算述当前接收到的语句为正常语句的概率。
可选地,在本实施例中,上述程序用于执行以下步骤:采用所述自然语言文句库预测所述当前接收到的语句接续出现的字符包括:
S1,确定所述当前接收到的语句所包含的字符个数,所述当前接收到的语句的验证方向以及待预测的备选字符的数量;
S2,在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;
S3,根据每个字符的概率和所述待预测的备选字符的数量计算每个备选字符的出现概率。
可选地,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本发明实施例的各模块或 各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明实施例不限制于任何特定的硬件和软件结合。
以上所述仅为本发明实施例的优选实施例而已,并不用于限制本发明实施例,对于本领域的技术人员来说,本发明实施例可以有各种更改和变化。凡在本发明实施例的原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明实施例的保护范围之内。
工业实用性
在本发明实施例的实施例中,采用根据训练数据集获取字信息,采用预设维数的字向量将字信息转换为待识别的测试集以及通过在RNN模型中对待识别的测试集进行训练,生成自然语言文句库的方式,达到了识别率高,简单易用,能够满足问答系统、检索系统、专家系统、在线客服、手机助手、私人助理等NLP业务需求的效果。

Claims (15)

  1. 一种自然语言文句库的生成方法,包括:
    根据训练数据集获取字信息;
    采用预设维数的字向量将所述字信息转换为待识别的测试集;
    通过在循环神经网络RNN模型中对所述待识别的测试集进行训练,生成自然语言文句库。
  2. 根据权利要求1所述的方法,其中,根据所述训练数据集获取所述字信息包括:
    对所述训练数据集中每个字符的出现频率进行统计,其中,所述字符包括以下至少之一:文字、数字、符号;
    将出现频率大于预设阈值的字符按照预设顺序进行排序,得到所述字信息。
  3. 根据权利要求2所述的方法,其中,所述RNN模型包括:输入层、隐藏层和输出层,其中,所述输入层与所述隐藏层相邻,所述隐藏层与所述输出层相邻。
  4. 根据权利要求3所述的方法,其中,通过在所述RNN模型中对所述待识别的测试集进行训练,生成所述自然语言文句库包括:
    从为所述RNN模型配置的RNN模型参数中提取隐藏层的个数以及每个隐藏层的神经元数目,训练数据截取长度;
    根据所述训练数据截取长度和所述预设维数计算得到所述输入层的神经元数目;
    按照所述字信息所包含字符的个数设置所述输出层的神经元数目;
    根据所述隐藏层的个数以及每个隐藏层的神经元数目,所述输入层的神经元数目和所述输出层的神经元数目对所述待识别的测试集中的每个字符进行训练,生成所述自然语言文句库。
  5. 根据权利要求4所述的方法,其中,在通过在所述RNN模型中对所述待识别的测试集进行训练,生成自然语言文句库之后,还包括以下之一:
    采用所述自然语言文句库验证当前接收到的语句是否为异常语句;
    采用所述自然语言文句库预测当前接收到的语句接续出现的字符。
  6. 根据权利要求5所述的方法,其中,采用所述自然语言文句库验证所述当前接收到的语句是否为所述异常语句包括:
    确定所述当前接收到的语句所包含的字符个数以及所述当前接收到的语句的验证方向;
    在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;
    根据每个字符的概率计算述当前接收到的语句为正常语句的概率。
  7. 根据权利要求5所述的方法,其中,采用所述自然语言文句库预测所述当前接收到的语句接续出现的字符包括:
    确定所述当前接收到的语句所包含的字符个数,所述当前接收到的语句的验证方向以及待预测的备选字符的数量;
    在所述自然语言文句库中按照所述验证方向计算所述当前接收 到的语句所包含的每个字符的概率;
    根据每个字符的概率和所述待预测的备选字符的数量计算每个备选字符的出现概率。
  8. 一种自然语言文句库的生成装置,包括:
    获取模块,设置为根据训练数据集获取字信息;
    转换模块,设置为采用预设维数的字向量将所述字信息转换为待识别的测试集;
    生成模块,设置为通过在循环神经网络RNN模型中对所述待识别的测试集进行训练,生成自然语言文句库。
  9. 根据权利要求8所述的装置,其中,所述获取模块包括:
    统计单元,设置为对所述训练数据集中每个字符的出现频率进行统计,其中,所述字符包括以下至少之一:文字、数字、符号;
    第一获取单元,设置为将出现频率大于预设阈值的字符按照预设顺序进行排序,得到所述字信息。
  10. 根据权利要求9所述的装置,其中,所述RNN模型包括:输入层、隐藏层和输出层,其中,所述输入层与所述隐藏层相邻,所述隐藏层与所述输出层相邻。
  11. 根据权利要求10所述的装置,其中,所述生成模块包括:
    提取单元,设置为从为所述RNN模型配置的RNN模型参数中提取隐藏层的个数以及每个隐藏层的神经元数目,训练数据截取长度;
    第一计算单元,设置为根据所述训练数据截取长度和所述预设维数计算得到所述输入层的神经元数目;
    设置单元,设置为按照所述字信息所包含字符的个数设置所述输出层的神经元数目;
    生成单元,设置为根据所述隐藏层的个数以及每个隐藏层的神经元数目,所述输入层的神经元数目和所述输出层的神经元数目对所述待识别的测试集中的每个字符进行训练,生成所述自然语言文句库。
  12. 根据权利要求11所述的装置,其中,所述装置还包括:
    处理模块,设置为采用所述自然语言文句库验证当前接收到的语句是否为异常语句;或者,采用所述自然语言文句库预测当前接收到的语句接续出现的字符。
  13. 根据权利要求12所述的装置,其中,所述处理模块包括:
    确定单元,设置为确定所述当前接收到的语句所包含的字符个数以及所述当前接收到的语句的验证方向;
    第二计算单元,设置为在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;
    第三计算单元,设置为根据每个字符的概率计算述当前接收到的语句为正常语句的概率。
  14. 根据权利要求12所述的装置,其中,所述处理模块包括:
    确定单元,设置为确定所述当前接收到的语句所包含的字符个数,所述当前接收到的语句的验证方向以及待预测的备选字符的数量;
    第二计算单元,设置为在所述自然语言文句库中按照所述验证方向计算所述当前接收到的语句所包含的每个字符的概率;
    第三计算单元,设置为根据每个字符的概率和所述待预测的备选 字符的数量计算每个备选字符的出现概率。
  15. 一种存储介质,所述存储介质包括存储的程序,其中,所述程序运行时执行权利要求1至7中任一项所述的方法。
PCT/CN2017/097213 2016-08-19 2017-08-11 自然语言文句库的生成方法及装置 WO2018033030A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP17841003.1A EP3508990A4 (en) 2016-08-19 2017-08-11 NATURAL LANGUAGE LIBRARY GENERATION METHOD AND DEVICE

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610697308.5 2016-08-19
CN201610697308.5A CN106372107B (zh) 2016-08-19 2016-08-19 自然语言文句库的生成方法及装置

Publications (1)

Publication Number Publication Date
WO2018033030A1 true WO2018033030A1 (zh) 2018-02-22

Family

ID=57878333

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/097213 WO2018033030A1 (zh) 2016-08-19 2017-08-11 自然语言文句库的生成方法及装置

Country Status (3)

Country Link
EP (1) EP3508990A4 (zh)
CN (1) CN106372107B (zh)
WO (1) WO2018033030A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726404A (zh) * 2018-12-29 2019-05-07 安徽省泰岳祥升软件有限公司 端到端模型的训练数据增强方法、装置及介质
CN109948163A (zh) * 2019-03-25 2019-06-28 中国科学技术大学 序列动态阅读的自然语言语义匹配方法
CN110162770A (zh) * 2018-10-22 2019-08-23 腾讯科技(深圳)有限公司 一种词扩展方法、装置、设备及介质
CN110874403A (zh) * 2018-08-29 2020-03-10 株式会社日立制作所 提问回答系统、提问回答处理方法以及提问回答整合系统
CN111159686A (zh) * 2019-12-05 2020-05-15 华侨大学 一种基于自然语言处理的人机验证方法和系统
CN111435408A (zh) * 2018-12-26 2020-07-21 阿里巴巴集团控股有限公司 对话纠错方法、装置和电子设备
CN116821436A (zh) * 2023-08-24 2023-09-29 北京遥感设备研究所 一种面向模糊查询的字符串谓词准确选择估计方法
CN116955613A (zh) * 2023-06-12 2023-10-27 广州数说故事信息科技有限公司 一种基于研报数据和大语言模型生成产品概念的方法

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372107B (zh) * 2016-08-19 2020-01-17 中兴通讯股份有限公司 自然语言文句库的生成方法及装置
CN108287858B (zh) * 2017-03-02 2021-08-10 腾讯科技(深圳)有限公司 自然语言的语义提取方法及装置
CN107844533A (zh) * 2017-10-19 2018-03-27 云南大学 一种智能问答系统及分析方法
CN110069143B (zh) * 2018-01-22 2024-06-07 北京搜狗科技发展有限公司 一种信息防误纠方法、装置和电子设备
CN108829664B (zh) * 2018-05-22 2022-04-22 广州视源电子科技股份有限公司 错别字检测方法、装置及计算机可读存储介质、终端设备
CN111160010B (zh) * 2019-12-31 2023-04-18 思必驰科技股份有限公司 缩略句理解模型的训练方法及系统
CN111566665B (zh) * 2020-03-16 2021-07-30 香港应用科技研究院有限公司 在自然语言处理中应用图像编码识别的装置和方法
US11132514B1 (en) 2020-03-16 2021-09-28 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for applying image encoding recognition in natural language processing
CN112528980B (zh) * 2020-12-16 2022-02-15 北京华宇信息技术有限公司 Ocr识别结果纠正方法及其终端、系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140249799A1 (en) * 2013-03-04 2014-09-04 Microsoft Corporation Relational similarity measurement
CN105653671A (zh) * 2015-12-29 2016-06-08 畅捷通信息技术股份有限公司 相似信息推荐方法及系统
CN105868184A (zh) * 2016-05-10 2016-08-17 大连理工大学 一种基于循环神经网络的中文人名识别方法
CN106372107A (zh) * 2016-08-19 2017-02-01 中兴通讯股份有限公司 自然语言文句库的生成方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279552B (zh) * 2014-06-18 2018-06-22 清华大学 一种基于字的神经网络的训练方法和装置
CN105244020B (zh) * 2015-09-24 2017-03-22 百度在线网络技术(北京)有限公司 韵律层级模型训练方法、语音合成方法及装置
CN105279495B (zh) * 2015-10-23 2019-06-04 天津大学 一种基于深度学习和文本总结的视频描述方法
CN105512680B (zh) * 2015-12-02 2019-01-08 北京航空航天大学 一种基于深度神经网络的多视sar图像目标识别方法
CN105740226A (zh) * 2016-01-15 2016-07-06 南京大学 使用树形神经网络和双向神经网络实现中文分词

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140249799A1 (en) * 2013-03-04 2014-09-04 Microsoft Corporation Relational similarity measurement
CN105653671A (zh) * 2015-12-29 2016-06-08 畅捷通信息技术股份有限公司 相似信息推荐方法及系统
CN105868184A (zh) * 2016-05-10 2016-08-17 大连理工大学 一种基于循环神经网络的中文人名识别方法
CN106372107A (zh) * 2016-08-19 2017-02-01 中兴通讯股份有限公司 自然语言文句库的生成方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3508990A4 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874403A (zh) * 2018-08-29 2020-03-10 株式会社日立制作所 提问回答系统、提问回答处理方法以及提问回答整合系统
CN110874403B (zh) * 2018-08-29 2024-03-08 株式会社日立制作所 提问回答系统、提问回答处理方法以及提问回答整合系统
CN110162770A (zh) * 2018-10-22 2019-08-23 腾讯科技(深圳)有限公司 一种词扩展方法、装置、设备及介质
CN110162770B (zh) * 2018-10-22 2023-07-21 腾讯科技(深圳)有限公司 一种词扩展方法、装置、设备及介质
CN111435408A (zh) * 2018-12-26 2020-07-21 阿里巴巴集团控股有限公司 对话纠错方法、装置和电子设备
CN111435408B (zh) * 2018-12-26 2023-04-18 阿里巴巴集团控股有限公司 对话纠错方法、装置和电子设备
CN109726404B (zh) * 2018-12-29 2023-11-10 安徽省泰岳祥升软件有限公司 端到端模型的训练数据增强方法、装置及介质
CN109726404A (zh) * 2018-12-29 2019-05-07 安徽省泰岳祥升软件有限公司 端到端模型的训练数据增强方法、装置及介质
CN109948163A (zh) * 2019-03-25 2019-06-28 中国科学技术大学 序列动态阅读的自然语言语义匹配方法
CN111159686B (zh) * 2019-12-05 2022-06-07 华侨大学 一种基于自然语言处理的人机验证方法和系统
CN111159686A (zh) * 2019-12-05 2020-05-15 华侨大学 一种基于自然语言处理的人机验证方法和系统
CN116955613A (zh) * 2023-06-12 2023-10-27 广州数说故事信息科技有限公司 一种基于研报数据和大语言模型生成产品概念的方法
CN116955613B (zh) * 2023-06-12 2024-02-27 广州数说故事信息科技有限公司 一种基于研报数据和大语言模型生成产品概念的方法
CN116821436A (zh) * 2023-08-24 2023-09-29 北京遥感设备研究所 一种面向模糊查询的字符串谓词准确选择估计方法
CN116821436B (zh) * 2023-08-24 2024-01-02 北京遥感设备研究所 一种面向模糊查询的字符串谓词准确选择估计方法

Also Published As

Publication number Publication date
CN106372107B (zh) 2020-01-17
EP3508990A1 (en) 2019-07-10
EP3508990A4 (en) 2020-04-01
CN106372107A (zh) 2017-02-01

Similar Documents

Publication Publication Date Title
WO2018033030A1 (zh) 自然语言文句库的生成方法及装置
CN110347835B (zh) 文本聚类方法、电子装置及存储介质
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
US10650311B2 (en) Suggesting resources using context hashing
WO2020177282A1 (zh) 一种机器对话方法、装置、计算机设备及存储介质
US11188581B2 (en) Identification and classification of training needs from unstructured computer text using a neural network
US11550871B1 (en) Processing structured documents using convolutional neural networks
WO2019153613A1 (zh) 聊天应答方法、电子装置及存储介质
CN110377740B (zh) 情感极性分析方法、装置、电子设备及存储介质
US20180082184A1 (en) Context-aware chatbot system and method
US11610064B2 (en) Clarification of natural language requests using neural networks
CN111428010B (zh) 人机智能问答的方法和装置
CN112860866B (zh) 语义检索方法、装置、设备以及存储介质
CN110347802B (zh) 一种文本分析方法及装置
CN111159409B (zh) 基于人工智能的文本分类方法、装置、设备、介质
US10685012B2 (en) Generating feature embeddings from a co-occurrence matrix
EP3213226A1 (en) Focused sentiment classification
EP3411835A1 (en) Augmenting neural networks with hierarchical external memory
CN116010684A (zh) 物品推荐方法、装置及存储介质
WO2022154897A1 (en) Classifier assistance using domain-trained embedding
CN111339277A (zh) 基于机器学习的问答交互方法及装置
EP3839800A1 (en) Recommending multimedia based on user utterances
CN111639162A (zh) 信息交互方法和装置、电子设备和存储介质
CN110321546B (zh) 账号识别、显示方法、装置、服务器、终端及存储介质
CN113112282A (zh) 基于客户画像处理咨诉问题的方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17841003

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017841003

Country of ref document: EP

Effective date: 20190319