RU2722571C1

RU2722571C1 - Method of recognizing named entities in network text based on elimination of probability ambiguity in neural network

Info

Publication number: RU2722571C1
Application number: RU2019117529A
Authority: RU
Inventors: Ян ЧЖОУ; Бин ЛЮ; Чжаою ХАНЬ; Чжонцю ВАН
Original assignee: Чайна Юниверсити Оф Майнинг Энд Текнолоджи
Priority date: 2017-05-27
Filing date: 2017-06-20
Publication date: 2020-06-01
Also published as: WO2018218705A1; CA3039280A1; CA3039280C; CN107203511B; CN107203511A; AU2017416649A1

Abstract

FIELD: computer equipment.

SUBSTANCE: invention relates to computer engineering. Disclosed is a method of recognizing named entities of network text based on eliminating ambiguity of probability in a neural network, involving: performing word decomposition on unmapped text body using Word2Vec model to extract word vector, converting reference text bodies into a word feature matrix, performing window processing, constructing a deep neural network for training, adding a Softmax function to the output layer of the neural network and performing normalization to obtain a matrix of probabilities of the category of named entities corresponding to each word; performing repeated processing of the probability matrix by the window method and using the model of conditional random fields to eliminate ambiguity to obtain the final tag of the named entity.

EFFECT: enabling recognition of named entities of network text based on elimination of probability ambiguity in a neural network.

7 cl, 3 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

Настоящее изобретение относится к обработке и анализу сетевого текста, в частности, к способу распознавания именованных сущностей сетевого текста на основе устранения неоднозначности вероятности в нейронной сети.The present invention relates to processing and analysis of network text, in particular, to a method for recognizing named entities of network text based on the elimination of probability ambiguity in a neural network.

УРОВЕНЬ ТЕХНИКИBACKGROUND

Сети подняли скорость и масштаб сбора и распространения информации на беспрецедентный уровень, сделали реальным глобальное распространение и обмен информацией и стали незаменимой инфраструктурой в информационном сообществе. Современные технологии коммуникации и распространения значительно повысили скорость и широту распространения информации. Однако существуют сопутствующие проблемы и «побочные эффекты»: иногда люди путаются в хаотичной информации, и бывает очень сложно быстро и точно выделить конкретную требуемую информацию из огромного объема информации. Это является предпосылкой для анализа и получения именованных сущностей, таких как люди, места, организации и т.д., интересующих пользователей Интернета, из массива сетевого текста для предоставления важной справочной информации для различных приложений верхнего уровня, таких как интернет-маркетинг, анализ эмоций группы и т.д. Соответственно, распознавание именованных сущностей сетевого текста стало важной базовой технологией обработки и анализа сетевых данных.Networks have increased the speed and scale of the collection and dissemination of information to an unprecedented level, have made the global dissemination and exchange of information real, and have become an indispensable infrastructure in the information community. Modern communication and dissemination technologies have significantly increased the speed and breadth of information dissemination. However, there are concomitant problems and “side effects”: sometimes people get confused in chaotic information, and it is very difficult to quickly and accurately extract specific information from a huge amount of information. This is a prerequisite for the analysis and retrieval of named entities, such as people, places, organizations, etc., of interest to Internet users, from an array of network text to provide important reference information for various top-level applications, such as Internet marketing, analysis of emotions groups etc. Accordingly, the recognition of named entities of network text has become an important basic technology for processing and analyzing network data.

В исследовании рассматриваются два способа распознавания именованных сущностей, а именно, способ на основе правил и способ на основе статистики. Поскольку теория машинного обучения постоянно совершенствуется и скорость вычислений значительно улучшается, способу на основе статистики отдается все большее предпочтение.The study considers two ways of recognizing named entities, namely, a rule-based method and a statistics-based method. Since the theory of machine learning is constantly being improved and the speed of computation is significantly improved, a method based on statistics is increasingly preferred.

В настоящее время статистические модели и способы, применяемые в распознавании именованных сущностей, в основном включают: скрытую марковскую модель, решающее дерево, модель максимальной энтропии, модель опорных векторов, условное случайное поле и искусственную нейронную сеть. Искусственная нейронная сеть может достичь лучшего результата в распознавании именованных сущностей, чем условное случайное поле, модель максимальной энтропии и другие модели, но модель условного случайного поля и максимальной энтропии по-прежнему являются доминирующими практическими моделями. Например, в Патентном документе № CN 201310182978.X предложен способ распознавания именованных сущностей и устройство для микроблогового текста на основе условного случайного поля и библиотеки именованных сущностей. В Патентном документе № CN 200710098635.X предложен способ распознавания именованных сущностей, который использует признаки слова и применяет модель максимальной энтропии для моделирования. Искусственную нейронную сеть сложно использовать на практике, поскольку она часто требует преобразования слов в векторы в пространстве векторов слов в области распознавания именованных сущностей. Вследствие этого, искусственная нейронная сеть не может применяться в крупномасштабных практических приложениях, потому что она не способна получать соответствующие векторы для новых слов.Currently, statistical models and methods used in the recognition of named entities mainly include a hidden Markov model, a decision tree, a model of maximum entropy, a model of support vectors, a conditional random field, and an artificial neural network. An artificial neural network can achieve a better result in the recognition of named entities than the conditional random field, the maximum entropy model and other models, but the conditional random field and maximum entropy model are still the dominant practical models. For example, Patent Document No. CN 201310182978.X proposes a method for recognizing named entities and a device for microblogging text based on a conditional random field and a library of named entities. Patent Document No. CN 200710098635.X proposes a method for recognizing named entities that uses the features of a word and applies a maximum entropy model for modeling. An artificial neural network is difficult to use in practice, because it often requires the conversion of words into vectors in the space of word vectors in the field of recognition of named entities. As a result, an artificial neural network cannot be used in large-scale practical applications, because it is not able to obtain the corresponding vectors for new words.

Вследствие вышеупомянутой существующей ситуации при распознавании именованных сущностей для сетевого текста в основном существуют следующие проблемы: во-первых, невозможно обучить пространство векторов слов, содержащее все слова, чтобы обучить нейронную сеть, потому что в сетевом тексте существует много сетевых слов, новых слов и неправильно написанных или искаженных символов; во-вторых, точность распознавания именованных сущностей для сетевых текстов ухудшается в результате существующих в сетевом тексте явлений, таких как произвольные языковые формы, нестандартные грамматические конструкции, неправильно написанные или искаженные символы и т.д.Due to the aforementioned existing situation, when recognizing named entities for network text, there are basically the following problems: firstly, it is impossible to train a word vector space containing all words in order to train a neural network, because there are many network words, new words and incorrect words in network text written or distorted characters; secondly, the recognition accuracy of named entities for network texts is deteriorated as a result of phenomena existing in the network text, such as arbitrary language forms, non-standard grammatical constructions, misspelled or distorted characters, etc.

РАСКРЫТИЕ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Задача изобретения состоит в преодолении недостатков предшествующего уровня техники, настоящее изобретение предоставляет способ распознавания именованных сущностей сетевого текста на основе устранения неоднозначности вероятности в нейронной сети, который выделяет признаки слова в пошаговом режиме без переобучения нейронной сети и выполняет распознавание путем устранения неоднозначности вероятности. Способ получает матрицу прогнозирования вероятности для названной категории именованной сущности слова из нейронной сети посредством обучения нейронной сети и выполняет устранение неоднозначностей на матрице прогнозирования, выведенной из нейронной сети в вероятностной модели, и тем самым повышает точность и правильность распознавания именованных сущностей сетевого текста.The objective of the invention is to overcome the disadvantages of the prior art, the present invention provides a method for recognizing named entities of network text based on the elimination of the probability of ambiguity in a neural network, which identifies the signs of a word in a step-by-step mode without retraining the neural network and performs recognition by eliminating the probability of ambiguity. The method obtains a probability prediction matrix for the named category of the named entity of the word from the neural network by training the neural network and performs disambiguation on the forecast matrix derived from the neural network in the probabilistic model, and thereby improves the accuracy and accuracy of recognition of named entities of the network text.

Техническая схема: для достижения задачи, описанной выше, техническая схема, используемая настоящим изобретением, является следующей:Technical scheme: to achieve the task described above, the technical scheme used by the present invention is as follows:

Способ распознавания именованных сущностей сетевого текста на основе устранения неоднозначности вероятности в нейронной сети выполняет разбиение на слова на неразмеченном корпусе текстов, используя Word2Vec для выделения вектора слова, преобразует эталонные корпуса текстов в матрицу признаков слова и выполняет обработку методом окна, выполняет построение глубокой нейронной сети для обучения, добавляет функцию Softmax в выходной слой нейронной сети и выполняет нормализацию для получения матрицы вероятностей категории именованных сущностей, соответствующих каждому слову; выполняет повторную обработку матрицы вероятностей методом окна и применяет модель условных случайных полей для устранения неоднозначностей для получения окончательного тега именованной сущности.A method for recognizing named entities of network text based on the elimination of probability ambiguity in a neural network performs word breaking on an unallocated body of texts, using Word2Vec to select a word vector, converts the reference body of texts into a matrix of word attributes and performs window processing, constructs a deep neural network for training, adds the Softmax function to the output layer of the neural network and performs normalization to obtain a probability matrix of the category of named entities corresponding to each word; performs reprocessing of the probability matrix by the window method and applies the conditional random field model to eliminate ambiguities to obtain the final tag of the named entity.

В частности, способ включает следующие этапы:In particular, the method includes the following steps:

этап 1: получение неразмеченного корпуса текстов при помощи поискового робота, получение эталонных корпусов текстов с тегами именованных сущностей из базы данных корпусов текстов и выполнение разбиения на слова на неразмеченном корпусе текстов при помощи естественно-языкового инструмента;stage 1: obtaining an unallocated body of texts with the help of a search robot, obtaining reference body of texts with tags of named entities from the database of body of texts and performing word splitting on an unallocated body of texts using a natural language tool;

этап 2: выполнение обучения пространства векторов слов на сегментированном неразмеченном корпусе текстов и эталонных корпусах текстов при помощи инструмента Word2Vec;stage 2: performing training of the space of word vectors on a segmented unallocated case of texts and reference cases of texts using the Word2Vec tool;

этап 3: преобразование текста в эталонных корпусах текстов в вектор слова, представляющий признаки слова в соответствии с обученной моделью Word2Vec (векторного представления слов), выполнение обработки вектора слова методом окна и использование двумерной матрицы, полученной умножением окна w на длину d вектора слова, в качестве данных, вводимых в нейронную сеть; преобразование тегов в эталонных корпусах текстов в форму для быстрого доступа (с одним активным состоянием) и использование их в качестве выходных данных нейронной сети; выполнение нормализации на выходном слое нейронной сети с помощью функции Softmax (многопеременная логистическая функция), так что результат категоризации, выдаваемый нейронной сетью, соответствует вероятности того, относится ли слово к неименованной сущности или именованной сущности, выполнение корректировки структуры, глубины, количества узлов, длины шага, функции активации и параметров начальных значений в нейронной сети и выбор функции активации для обучения нейронной сети;stage 3: converting the text in the reference text cases into a word vector representing the word attributes in accordance with the trained Word2Vec model (vector representation of words), performing the processing of the word vector by the window method and using a two-dimensional matrix obtained by multiplying the window w by the length d of the word vector, in quality of data input into the neural network; conversion of tags in the reference text bodies into a form for quick access (with one active state) and their use as output data of a neural network; performing normalization on the output layer of the neural network using the Softmax function (multi-variable logistic function), so that the categorization result generated by the neural network corresponds to the probability of whether the word refers to an unnamed entity or a named entity, adjusting the structure, depth, number of nodes, length steps, activation functions and parameters of initial values in the neural network and selection of the activation function for training the neural network;

этап 4: выполнение повторной обработки методом окна матрицы прогнозирования, выведенной из нейронной сети, с использованием информации, прогнозирующей контекст слова, подлежащего тегированию, в качестве точки корреляции с фактической категорией слова, подлежащего тегированию, в модели условных случайных полей, использование алгоритма максимизации оценивания для расчета ожидаемых значений по всем сторонам в соответствии с обучающими корпусами текстов и обучение соответствующей модели условных случайных полей;stage 4: performing repeated processing by the method of the window of the prediction matrix derived from the neural network, using information predicting the context of the word to be tagged as a correlation point with the actual category of the word to be tagged in the conditional random field model, using the estimation maximization algorithm for calculating expected values on all sides in accordance with the training corps of texts and training the corresponding model of conditional random fields;

этап 5: в процессе распознавания вначале выполняют преобразование текста, подлежащего распознаванию, в вектор слова, который отображает признаки слова в соответствии с обученной моделью Word2Vec, и если модель Word2Vec не содержит соответствующего обучающего слова, выполняется преобразование слова в вектор слова посредством пошагового обучения, извлечение вектора слова и обратное отслеживание пространства векторов слов и т.д., выполняют обработку вектора слова методом окна и используют двумерную матрицу, полученную умножением окна w на длину d вектора слова, в качестве данных, вводимых в нейронную сеть; затем выполняют повторную обработку методом окна матрицы прогнозирования, полученной из нейронной сети, выполняют устранение неоднозначностей на матрице прогнозирования в обученной модели условных случайных полей и получают окончательный тег именованной сущности текста, подлежащего распознаванию.step 5: in the recognition process, first, the text to be recognized is converted to a word vector that displays the features of the word in accordance with the trained Word2Vec model, and if the Word2Vec model does not contain the corresponding learning word, the word is converted into a word vector by step-by-step learning, extraction word vectors and inverse tracking of the space of word vectors, etc., perform word vector processing by the window method and use a two-dimensional matrix obtained by multiplying the window w by the length d of the word vector as data input into the neural network; then, repeated processing by the window prediction matrix method obtained from the neural network is performed, disambiguation on the forecast matrix in the trained model of conditional random fields is performed, and the final tag of the named entity of the text to be recognized is obtained.

Предпочтительно, параметры инструмента Word2Vec являются следующими: длина вектора слова: 200, число итераций: 25, начальная длина шага: 0,025, минимальная длина шага: 0,0001 и выбрана модель CBOW.Preferably, the parameters of the Word2Vec tool are: word vector length: 200, number of iterations: 25, initial step length: 0.025, minimum step length: 0.0001, and the CBOW model is selected.

Предпочтительно, параметры нейронной сети являются следующими: количество скрытых слоев: 2, количество скрытых узлов: 150, длина шага: 0,01, размер пакета (batchSize): 40, функция активации: сигмоидальная функция.Preferably, the parameters of the neural network are as follows: number of hidden layers: 2, number of hidden nodes: 150, step length: 0.01, packet size (batchSize): 40, activation function: sigmoid function.

Предпочтительно, преобразование тегов в эталонных корпусах текстов в форму для быстрого доступа выполняют следующим способом: выполняют преобразование тегов "/о", "/n" и "/р" в эталонных корпусах текстов в теги именованной сущности "/Org-B", "Org-I", "/Per-B", "/Per-I", "/Loc-B" и "/Loc-I", соответственно, и последующее выполняют преобразование тегов именованной сущности в форму для быстрого доступа.Preferably, the conversion of tags in the reference text bodies to a quick access form is performed in the following way: converting the tags "/ o", "/ n" and "/ p" in the reference text bodies to tags of the named entity "/ Org-B", " Org-I "," / Per-B "," / Per-I "," / Loc-B "and" / Loc-I ", respectively, and the subsequent ones convert the tags of the named entity into a form for quick access.

Предпочтительно, размер окна для выполнения обработки вектора слова методом окна равен 5.Preferably, the window size for performing word vector processing by the window method is 5.

Предпочтительно, при обучении нейронной сети, одна десятая слов выделяется из эталонных данных и исключается из обучения нейронной сети, но используется в качестве критерия оценки для нейронной сети.Preferably, when training a neural network, one tenth of the words are extracted from the reference data and excluded from the training of the neural network, but is used as an evaluation criterion for the neural network.

По сравнению с предшествующим уровнем техники настоящее изобретение обеспечивает следующие полезные эффекты:Compared with the prior art, the present invention provides the following beneficial effects:

Векторы слов без переобучения нейронной сети можно выделять в пошаговом режиме, прогнозирование можно выполнять с помощью нейронной сети и устранение неоднозначностей можно выполнять с помощью вероятностной модели, так что способ достигает лучшей выполнимости, точности и правильности при распознавании именованных сущностей сетевого текста. В задаче распознавания именованных сущностей сетевого текста настоящее изобретение предусматривает способ пошагового обучения вектора слова без изменения структуры нейронной сети в соответствии с особенностью существования сетевых слов и новых слов, и использует модель устранения неоднозначности вероятности для решения проблем, заключающихся в том, что сетевые тексты имеют нестандартную грамматическую конструкцию и содержат много неправильно написанных или искаженных символов. Таким образом, способ, предоставленный в настоящем изобретении, обеспечивает высокую точность в задачах распознавания именованных сущностей сетевого текста.Word vectors without retraining a neural network can be selected in a step-by-step mode, forecasting can be performed using a neural network, and disambiguation can be performed using a probabilistic model, so that the method achieves better feasibility, accuracy and correctness when recognizing named entities of network text. In the task of recognizing named entities of network text, the present invention provides a method for step-by-step learning of a word vector without changing the structure of a neural network in accordance with the peculiarity of the existence of network words and new words, and uses a probability disambiguation model to solve the problems that network texts have non-standard grammatical construction and contain a lot of misspelled or distorted characters. Thus, the method provided in the present invention provides high accuracy in recognition tasks of named entities of network text.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

На фиг. 1 представлена блок-схема обучения устройства для распознавания именованных сущностей сетевого текста на основе устранения неоднозначности вероятности в нейронной сети согласно настоящему изобретению;In FIG. 1 is a block diagram of a training device for recognizing named entities of network text based on disambiguation of probability in a neural network according to the present invention;

На фиг. 2 представлена блок-схема преобразования слова в признаки слова согласно настоящему изобретению;In FIG. 2 is a flowchart of converting a word into a word attribute according to the present invention;

На фиг. 3 представлена принципиальная схема обработки текста и архитектуры нейронной сети согласно настоящему изобретению.In FIG. 3 is a schematic diagram of text processing and neural network architecture according to the present invention.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

Далее настоящее изобретение будет дополнительно подробно описано в соответствии с вариантами осуществления со ссылкой на прилагаемые чертежи. Следует понимать, что данные варианты осуществления представлены только для описания настоящего изобретения и не должны рассматриваться как создающие какое-либо ограничение объема настоящего изобретения. После прочтения данного раскрытия модификации настоящего изобретения в различных эквивалентных формах, сделанные специалистами в данной области техники, будут считаться входящими в защищаемый объем, как определено прилагаемой формулой изобретения в данной заявке.The invention will now be further described in detail in accordance with embodiments with reference to the accompanying drawings. It should be understood that these options for implementation are presented only to describe the present invention and should not be construed as creating any limitation of the scope of the present invention. After reading this disclosure, modifications of the present invention in various equivalent forms made by those skilled in the art will be deemed to be within the protected scope as defined by the appended claims in this application.

Способ распознавания именованных сущностей сетевого текста на основе устранения неоднозначности вероятности в нейронной сети выполняет разбиение на слова на неразмеченном корпусе текстов, используя модель Word2Vec для выделения вектора слова, преобразует эталонные корпуса текстов в матрицу признаков слова и выполняет обработку методом окна (windowing), создает глубокую нейронную сеть для обучения, добавляет функцию Softmax в выходной слой нейронной сети и выполняет нормализацию для получения матрицы вероятностей категории именованных сущностей, соответствующих каждому слову; выполняет повторную обработку матрицы вероятностей методом окна и применяет модель условных случайных полей для устранения неоднозначности для получения окончательного тега именованной сущности.A method for recognizing named entities of network text based on the elimination of probability ambiguity in a neural network performs word breaking on an unallocated body of texts, using the Word2Vec model to highlight a word vector, converts the reference text bodies into a word feature matrix and performs windowing processing, creates a deep a neural network for training, adds the Softmax function to the output layer of a neural network and performs normalization to obtain a probability matrix of the category of named entities corresponding to each word; performs reprocessing of the probability matrix by the window method and applies the conditional random field model to eliminate the ambiguity to obtain the final tag of the named entity.

этап 1: Получение неразмеченного корпуса текстов при помощи поискового робота, загрузка корпуса текстов с тегами именованной сущности в качестве эталонных корпусов текстов из базы данных корпусов текстов, и выполнение разбиения на слова на неразмеченном корпусе текстов при помощи естественно-языкового инструмента;Stage 1: Obtaining an unallocated body of texts using a search robot, loading a body of texts with tags of a named entity as reference text bodies from the database of body of texts, and breaking into words on an unallocated body of texts using a natural language tool;

этап 2: Выполнение обучения пространства векторов слов на сегментированном неразмеченном корпусе текстов и эталонных корпусах текстов при помощи инструмента Word2Vec;stage 2: Performing training of the space of word vectors on a segmented unallocated body of texts and reference text bodies using the Word2Vec tool;

этап 3: Преобразование текста в эталонных корпусах текстов в вектор слова, представляющий признаки слова в соответствии с обученной моделью Word2Vec, и использование вектора слова в качестве данных, вводимых в нейронную сеть; преобразование тегов в эталонных корпусах текстов в форму для быстрого доступа и использование их в качестве выходных данных нейронной сети. Ввиду того, что в задаче обработки текста именованная сущность может быть разделена на несколько слов, тегирование выполняют в шаблоне 10 В, чтобы гарантировать, что распознанная именованная сущность имеет целостность.stage 3: Converting the text in the reference text bodies into a word vector representing the characteristics of the word in accordance with the trained Word2Vec model, and using the word vector as data input into the neural network; conversion of tags in reference text cases into a form for quick access and their use as output from a neural network. Due to the fact that in a text processing task a named entity can be divided into several words, tagging is performed in a 10 V template to ensure that the recognized named entity has integrity.

К какой названной категории сущности относится слово, следует оценивать не только на основе самого слова, но и дополнительно оценивать в соответствии с контекстной информацией о слове. Таким образом, при построении нейронной сети вводится понятие «окна», то есть при оценке слова, как слово, так и характерная информация контента в виде его фиксированной длины принимаются в качестве входных данных для нейронной сети; таким образом, входной информацией в нейронную сеть больше не является длина d вектора признака слова, а вместо этого представляет собой двумерную матрицу, полученную умножением окна w на длину d вектора признака слова.Which named category of entity the word belongs to, should be evaluated not only on the basis of the word itself, but also additionally evaluated in accordance with the contextual information about the word. Thus, when constructing a neural network, the concept of “window” is introduced, that is, when evaluating a word, both the word and the characteristic content information in the form of its fixed length are taken as input to the neural network; thus, the input information to the neural network is no longer the length d of the word attribute vector, but instead is a two-dimensional matrix obtained by multiplying the window w by the length d of the word attribute vector.

Выходной слой нейронной сети нормализируется при помощи функции Softmax, так что результат категоризации, выдаваемый нейронной сетью, соответствует вероятности того, относится ли слово к неименованной сущности или именованной сущности. Структура, глубина, количество узлов, длина шага, функция активации, параметры начальных значений в нейронной сети настраиваются и для обучения нейронной сети выбирается функция активации.The output layer of the neural network is normalized using the Softmax function, so the categorization result generated by the neural network corresponds to the probability of whether the word refers to an unnamed entity or a named entity. The structure, depth, number of nodes, step length, activation function, initial values in the neural network are configured and the activation function is selected to train the neural network.

этап 4: Выполнение повторной обработки методом окна матрицы прогнозирования, выведенной из нейронной сети, с использованием информации, прогнозирующей контекст слова, подлежащего тегированию в качестве точки корреляции с фактической категорией слова, подлежащего тегированию, в модели условных случайных полей, использование алгоритма максимизации оценивания (ЕМ algorithm) для расчета ожидаемых значений по всем сторонам в соответствии с обучающими корпусами текстов и обучение соответствующей модели условных случайных полей;stage 4: Performing repeated processing by the method of the window of the prediction matrix derived from the neural network using information predicting the context of the word to be tagged as a correlation point with the actual category of the word to be tagged in the model of conditional random fields, using the estimation maximization algorithm (EM algorithm) for calculating expected values on all sides in accordance with the training corps of texts and training the corresponding model of conditional random fields;

этап 5: В процессе распознавания вначале выполняют преобразование текста, подлежащего распознаванию, в вектор слова, который отображает признаки слова в соответствии с обученной моделью Word2Vec, и если модель Word2Vec не содержит соответствующего обучающего слова, выполняют преобразование слова в вектор слова посредством пошагового обучения, извлечение вектора слова и обратное отслеживание пространства векторов слов и т.д.step 5: In the recognition process, first, the text to be recognized is converted to a word vector that displays the features of the word in accordance with the trained Word2Vec model, and if the Word2Vec model does not contain the corresponding learning word, the word is converted to a word vector by step-by-step learning, extraction word vectors and inverse tracking of the space of word vectors, etc.

(1) сопоставление с эталоном слова, подлежащего преобразованию, в пространстве векторов слов;(1) comparison with the standard of the word to be transformed in the space of word vectors;

(2) преобразование слова, подлежащего преобразованию, непосредственно в соответствующий вектор слова, если слову найдено соответствие в пространстве векторов слов;(2) the conversion of the word to be converted directly into the corresponding vector of the word if a word is found in the space of the word vectors;

(3) если модель Word2Vec не содержит соответствующего слова, выполняется резервное копирование пространства векторов слов для предотвращения снижения точности модели нейронной сети, вызванного отклонением пространства слов, созданного при пошаговом обучении, загрузка модели Word2Vec, получение предложения, в котором существует несоответствующее слово, ввод предложения в модель Word2Vec и выполнение пошагового обучения, получение вектора слова данного слова и использование резервного пространства векторов слов для выполнения обратного отслеживания модели;(3) if the Word2Vec model does not contain the corresponding word, the word vector space is backed up to prevent a decrease in the accuracy of the neural network model caused by the deviation of the word space created by step-by-step training, loading the Word2Vec model, receiving a sentence in which an inappropriate word exists, entering a sentence into the Word2Vec model and performing step-by-step training, obtaining the word vector of a given word and using the reserve space of word vectors to perform reverse tracking of the model;

выполнение обработки вектора слова методом окна, и использование двумерной матрицы, полученной умножением окна w на длину d вектора слова, в качестве данных, вводимых в нейронную сеть; последующее выполнение повторной обработки методом окна матрицы прогнозирования, полученной из нейронной сети, выполнение устранения неоднозначностей на матрице прогнозирования в обученной модели условных случайных полей и получение окончательного тега именованной сущности текста, подлежащего распознаванию.execution of the processing of the word vector by the window method, and the use of a two-dimensional matrix obtained by multiplying the window w by the length d of the word vector as data input into the neural network; subsequent reprocessing by the method of the prediction matrix window obtained from the neural network, disambiguation on the forecast matrix in the trained model of conditional random fields, and obtaining the final tag of the named entity of the text to be recognized.

ПримерExample

Сетевой текст получен при помощи поискового робота на веб-сайте Sogou News (http://news.sogou.com/), корпуса текстов с тегами именованной сущности загружены из базы данных корпусов текстов Datatang (http://www.datatang.com/) в качестве эталонных корпусов текстов, разбиение на слова выполнено на полученном сетевом тексте при помощи естественно-языкового инструмента, обучение векторного пространства слов выполнено на сегментированном корпусе текстов и эталонном корпусе текстов при помощи пакета genism-библиотек Питон с использованием модели Word2Vec, использующей следующие параметры: длина вектора слова: 200, число итераций: 25, начальная длина шага: 0,025, и минимальная длина шага: 0,0001, и выбрана модель CBOW.Web text was obtained using a search robot on the Sogou News website (http://news.sogou.com/), text bodies with tags of a named entity were downloaded from the Datatang text database (http://www.datatang.com/ ) as reference text bodies, word breaks are performed on the resulting network text using a natural language tool, vector word space training is performed on a segmented text body and reference text body using the Python genism library package using the Word2Vec model using the following parameters : word vector length: 200, iteration number: 25, initial step length: 0.025, and minimum step length: 0.0001, and the CBOW model is selected.

Текст в эталонных корпусах текстов преобразован в вектор слов, представляющий признаки слов в соответствии с обученной моделью Word2Vec, и в случае, если модель Word2Vec не содержит соответствующего обучающего слова, слово преобразуется в вектор слова посредством пошагового обучения, извлечения вектора слова и обратного отслеживания пространства векторов слов и т.д., в качестве признаков слова. Теги "/о", "/n" и "/р" в эталонных корпусах текстов, полученных из Datatang, преобразованы в теги именованной сущности "/Org-B", "/Org-I", "/Per-B", "/Per-I", "/Loc-B" и "/Loc-I" и т.д. соответственно, и последующие теги именованной сущности преобразованы в форму для быстрого доступа в качестве выходных данных нейронной сети.The text in the reference text bodies is converted into a vector of words representing the characteristics of words in accordance with the trained Word2Vec model, and if the Word2Vec model does not contain the corresponding training word, the word is converted into a word vector by step-by-step learning, extracting the word vector and reverse tracking the vector space words, etc., as signs of a word. The tags "/ o", "/ n" and "/ p" in the reference cases of texts obtained from Datatang are converted to tags of the named entity "/ Org-B", "/ Org-I", "/ Per-B", "/ Per-I", "/ Loc-B" and "/ Loc-I", etc. accordingly, the subsequent tags of the named entity are converted into a form for quick access as output from a neural network.

Размер окна установлен равным 5, то есть при рассмотрении категории именованных сущностей текущего слова признаки слова данного слова, и двух слов перед словом и двух слов после слова, используются в качестве входных данных для нейронной сети; информацией, вводимой в нейронную сеть является вектор с размером пакета*1000; одна десятая слов извлечена из эталонных данных и исключена из обучения нейронной сети, но использована в качестве критерия оценки для нейронной сети; выходной слой нейронной сети нормализован при помощи функции Softmax, так что результат категоризации, выдаваемый нейронной сетью, соответствует вероятности того, относится ли слово к неименованной сущности или именованной сущности; максимальное значение вероятности временно принимается в качестве окончательного результата категоризации. Параметры в нейронной сети, такие как структура, глубина, количество узлов, длина шага, функция активации и начальное значение и т.д. настроены для обеспечения высокой точности нейронной сети; окончательные параметры являются следующими: количество скрытых слоев: 2, количество скрытых узлов: 150, длина шага: 0,01, размер пакета: 40, функция активации: сигмоидальная; таким образом может быть получен хороший эффект категоризации, точность может достигать 99,83%, а значения F наиболее типичных личных имен, географических названий и названий организаций могут составлять 93,4%, 84,2% и 80,4% соответственно.The window size is set equal to 5, that is, when considering the category of named entities of the current word, the signs of the word of the given word, and two words before the word and two words after the word, are used as input for a neural network; the information entered into the neural network is a vector with a packet size * 1000; one tenth of the words extracted from the reference data and excluded from the training of the neural network, but used as an evaluation criterion for the neural network; the output layer of the neural network is normalized using the Softmax function, so that the categorization result generated by the neural network corresponds to the probability of whether the word refers to an unnamed entity or a named entity; the maximum probability value is temporarily taken as the final result of categorization. Parameters in the neural network, such as structure, depth, number of nodes, step length, activation function and initial value, etc. tuned for high accuracy neural network; final parameters are as follows: number of hidden layers: 2, number of hidden nodes: 150, step length: 0.01, packet size: 40, activation function: sigmoidal; in this way, a good categorization effect can be obtained, accuracy can reach 99.83%, and F values of the most typical personal names, geographical names and company names can be 93.4%, 84.2% and 80.4%, respectively.

Этап получения максимального значения матрицы прогнозирования, выведенной из нейронной сети в виде конечного результата категоризации, удален, выполнена прямая обработка матрицы вероятностей методом окна, информация, прогнозирующая контекст слова, подлежащего тегированию, использована в качестве точки корреляции с фактической категорией слова, подлежащего тегированию, в модели условных случайных полей, для расчета ожидаемых значений использован алгоритм максимизации оценивания на всех сторонах условного случайного поля в соответствии с обучающими корпусами текстов, и выполнено обучение соответствующей модели условных случайных полей; после устранения неоднозначностей с использованием условного случайного поля, значения F личных имен, географических названий и названий организаций могут быть улучшены до 94,8%, 85,0% и 82,0% соответственно.The step of obtaining the maximum value of the prediction matrix derived from the neural network as the final categorization result is deleted, the probability matrix is directly processed by the window method, information predicting the context of the word to be tagged is used as a correlation point with the actual category of the word to be tagged in models of conditional random fields, to calculate the expected values, an algorithm for maximizing estimates on all sides of the conditional random field in accordance with the training corps of texts was used, and the corresponding model of conditional random fields was trained; after disambiguation using a conditional random field, the F values of personal names, geographical names and company names can be improved to 94.8%, 85.0% and 82.0%, respectively.

Из описанного выше варианта осуществления видно, что по сравнению с обычным контролируемым способом распознавания именованных сущностей, в способе распознавания в тексте именованных сущностей на основе устранения неоднозначности вероятности в нейронной сети, представленном в настоящем изобретении, используется способ преобразования векторов слов, который можно использовать для выделения признаков слов в пошаговом режиме, не вызывая отклонения пространства векторов слов; таким образом, нейронная сеть может применяться к сетевому тексту, который содержит много новых слов и неправильно написанных или искаженных символов. Кроме того, в настоящем изобретении выполняется повторная обработка методом окна матрицы вероятностей, выводимой из нейронной сети, и выполняется устранение неоднозначностей контекста с применением модели условных случайных полей, чтобы решить проблему, когда сетевой текст содержит много неправильно написанных или искаженных символов и нестандартные грамматические конструкции.From the embodiment described above, it can be seen that, compared to the conventional controlled method for recognizing named entities, the method for recognizing named entities in the text based on disambiguation of probability in a neural network of the present invention uses a method of transforming word vectors that can be used to extract signs of words in a step-by-step mode, without causing a deviation of the space of word vectors; thus, a neural network can be applied to network text that contains many new words and misspelled or distorted characters. In addition, in the present invention, the window matrix is reprocessed by a probability matrix derived from a neural network, and context ambiguities are applied using a conditional random field model to solve the problem when the network text contains many incorrectly written or distorted characters and non-standard grammatical constructions.

Хотя настоящее изобретение описано выше в виде некоторых предпочтительных вариантов осуществления, следует отметить, что специалисты в данной области техники могут вносить различные улучшения и модификации, не отступая от принципа настоящего изобретения, и эти улучшения и модификации следует рассматривать как подпадающие под объем защиты настоящего изобретения.Although the present invention has been described above in the form of some preferred embodiments, it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should be considered as falling within the protection scope of the present invention.

Claims

1. A method for recognizing named entities of network text based on the elimination of probability ambiguity in a neural network, including: performing word splitting on an unallocated body of texts using the Word2Vec model to select a word vector, converting standard text bodies into a word feature matrix, processing by window method, building a deep neural network for training, adding the Softmax function to the output layer of the neural network and performing normalization to obtain a probability matrix of the category of named entities corresponding to each word; reprocessing the probability matrix by the window method and using the conditional random field model to eliminate the ambiguity to obtain the final tag of the named entity.

2. A method for recognizing named entities of network text based on the elimination of the ambiguity of probability in a neural network according to claim 1, comprising the following steps:

stage 1: obtaining an unallocated body of texts with the help of a search robot, obtaining reference body of texts with tags of named entities from the database of body of texts and performing word splitting on an unallocated body of texts using a natural language tool;

stage 2: performing training of the space of word vectors on a segmented unallocated case of texts and reference cases of texts using the Word2Vec tool;

stage 3: converting the text in the reference text bodies into a word vector representing the word attributes in accordance with the trained Word2Vec model, performing the processing of the word vector by the window method and using a two-dimensional matrix obtained by multiplying the window w by the length d of the word vector as the data input into neural network; conversion of tags in the reference text cases into a form for quick access and their use as output from a neural network; performing normalization on the output layer of the neural network using the Softmax function, so that the categorization result generated by the neural network corresponds to the probability of whether the word refers to an unnamed entity or named entity, adjusting the structure, depth, number of nodes, step length, activation function and parameters initial values in the neural network and the selection of the activation function for training the neural network;

stage 4: performing repeated processing by the method of the window of the prediction matrix derived from the neural network, using information predicting the context of the word to be tagged as a correlation point with the actual category of the word to be tagged in the conditional random field model, using the estimation maximization algorithm for calculating expected values on all sides in accordance with the training corps of texts and training the corresponding model of conditional random fields;

step 5: in the recognition process, the text to be recognized is first converted to a word vector that displays the features of the word in accordance with the trained Word2Vec model, and if the Word2Vec model does not contain the corresponding word, the word is converted to a word vector by step-by-step learning, extraction word vectors and inverse tracking of the space of word vectors, etc., perform word vector processing by the window method and use a two-dimensional matrix obtained by multiplying the window w by the length d of the word vector as data input into the neural network; then, repeated processing by the window prediction matrix method obtained from the neural network is performed, disambiguation on the forecast matrix in the trained model of conditional random fields is performed, and the final tag of the named entity of the text to be recognized is obtained.

3. A method for recognizing named entities of network text based on the elimination of probability ambiguity in a neural network according to claim 1, wherein the parameters of the Word2Vec tool are as follows: word vector length: 200, number of iterations: 25, initial step length: 0,025, minimum step length: 0.0001 and the CBOW model is selected.

4. A method for recognizing named entities of network text based on the elimination of probability ambiguity in a neural network according to claim 1, wherein the parameters of the neural network are as follows: number of hidden layers: 2, number of hidden nodes: 150, step length: 0.01, packet size : 40, activation function: sigmoid function.

5. A method for recognizing named entities of network text based on the elimination of probability ambiguity in a neural network according to claim 1, in which the tags in the reference text bodies are converted into a form for quick access in the following way: conversion of the "/ о", "/ n" and "tags / p "in the reference text bodies in the tags of the named entity" / Org-B "," / Org-I "," / Per-B "," / Per-I "," / Loc-B "and" / Loc- I ", respectively, and the subsequent conversion of tags of named entities into a form for quick access.

6. A method for recognizing named entities of network text based on the elimination of probability ambiguity in a neural network according to claim 1, wherein the window size for processing a word vector by the window method is 5.

7. A method for recognizing named entities of a network text based on the elimination of probability ambiguity in a neural network according to claim 1, in which, when training a neural network, one tenth of the words are extracted from the reference data and excluded from the training of the neural network, but used as an evaluation criterion for the neural network .