CN113128216A - Language identification method, system and device - Google Patents

Language identification method, system and device Download PDF

Info

Publication number
CN113128216A
CN113128216A CN201911408163.2A CN201911408163A CN113128216A CN 113128216 A CN113128216 A CN 113128216A CN 201911408163 A CN201911408163 A CN 201911408163A CN 113128216 A CN113128216 A CN 113128216A
Authority
CN
China
Prior art keywords
participle
index
standard
information
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911408163.2A
Other languages
Chinese (zh)
Other versions
CN113128216B (en
Inventor
邓千
刚周伟
郭麟
陈田川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Guizhou Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Guizhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Guizhou Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201911408163.2A priority Critical patent/CN113128216B/en
Publication of CN113128216A publication Critical patent/CN113128216A/en
Application granted granted Critical
Publication of CN113128216B publication Critical patent/CN113128216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method, a system and a device for identifying a language, and belongs to the technical field of identification. The language identification method comprises the steps of establishing a standard vocabulary library storing standard index names aiming at the technical field, judging whether the standard vocabulary library has the completely corresponding standard index names or not after acquiring index class description of sentences, and if so, taking the standard index names as index class information of the sentences to improve the language identification in the technical field; if not, the word segmentation is carried out by combining the standard word library and the conventional word library to obtain a word segmentation list, and the word segmentation list is analyzed or the index information is obtained.

Description

Language identification method, system and device
Technical Field
The present invention relates to the field of recognition technologies, and in particular, to a method, a system, and an apparatus for language recognition.
Background
Natural language processing is one of the major directions of artificial intelligence technology and is currently being used in various industries. The natural language processing may be used for human-machine conversation, and the content of the conversation may be chatting.
Because the chatting usually has no specific purpose, the word segmentation is usually performed based on a conventional word stock, and when the natural language processing is applied to the human-computer conversation in the professional field, the conventional word stock cannot perform correct word segmentation on the description in the professional field, so that the robot cannot correctly recognize terms and questions in the professional field, and the answer of the robot looks like being answered. Therefore, a language identification method applicable to the professional field needs to be provided.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method, system and device for language identification.
In a first aspect, the present invention provides a method for language identification, comprising:
acquiring index class description of a statement;
judging whether the index class description has a completely corresponding standard index name in a standard vocabulary library or not;
if yes, the standard index name is used as index class information;
if not, performing word segmentation on the index class description by combining the standard vocabulary library and the conventional vocabulary library to obtain a word segmentation list, analyzing the word segmentation list to obtain a target index name, and taking the target index name as the index class information.
In the above language identification method, the obtaining of the word segmentation list after performing word segmentation on the index class description by combining the standard vocabulary library and the conventional vocabulary library includes:
extracting a first participle in the index class description, wherein the first participle has a corresponding standard vocabulary in the standard vocabulary library;
extracting a second participle in the index class description, wherein the second participle has a corresponding conventional vocabulary in the conventional word bank;
and combining the first participle and the second participle to obtain the participle list.
In the above language identification method, the parsing the word segmentation list to obtain the target index name includes:
searching a standard index name at least partially corresponding to each participle in a participle list in the standard vocabulary library, and generating a participle set corresponding to each participle according to the standard index name at least partially corresponding to each participle, wherein elements in the participle set are the standard index names in the standard vocabulary library;
taking intersection of the participle sets of all participles;
if the intersection is a non-empty set, taking the elements of the intersection as target index names;
and if the intersection is an empty set, taking a union set of the participle sets of all the participles, respectively calculating the similarity between each element in the union set and the participle list, and acquiring a target element from the union set according to the calculation result of the similarity to be used as a target index name.
The above language identification method, wherein the calculating the similarity between each element in the union and the word segmentation list respectively includes:
dividing words of each element in the parallel set according to standard words in a standard word library to obtain element division words of each element;
calculating word segmentation similarity in sequence, wherein the word segmentation similarity is the similarity between each element word in each element and each word in the word segmentation list;
and calculating the average value of all the word segmentation similarity in each element as the similarity of the corresponding element and the word segmentation list.
In the above language identification method, the obtaining the target element from the union set according to the calculation result of the similarity, and using the target element as the target index name includes:
taking the element with the highest similarity in the union set as a target element to obtain a target index name; or
And taking the elements with at least the first two names of similarity in the union set as at least two target elements to obtain at least two target index names.
Before the index class description vocabulary of the extracted sentence, the above language identification method further includes:
extracting the temporal description of the sentence, and obtaining the time information: and/or
Extracting the geographical location description of the sentence to obtain geographical location information; and/or
Extracting data operational description of the statement to obtain data operational information; and/or
Removing the disabled vocabulary of the sentence;
after the obtaining of the index class information, the method further includes:
and recognizing to obtain statement semantics according to the index information, the time information and/or the geographic position information and/or the data operation information.
The above language identification method, before obtaining the index class description of the sentence, includes: receiving an input sentence to be recognized;
in the sentence semantics identified according to the index class information, the time information and/or the geographic location information and/or the data operation information, the method includes: identifying and obtaining the semantics of an input statement according to the index information, the time information and/or the geographic position information and/or the data operation information;
after the sentence semantics are identified and obtained according to the index class information, the time information and/or the geographic position information and/or the data operation information, the method comprises the following steps:
outputting corresponding answer data aiming at the semantics of the input sentence;
receiving feedback information whether the user is satisfied with the answer data;
and updating the mapping relation between the input statement to be recognized and the answer data in a mapping library according to the feedback information.
In a second aspect, the present invention provides a language recognition system comprising: a data receiver and a server; the data receiver collects the language to be recognized;
the server comprises a first port, a memory and a processor;
the first port is used for receiving an input sentence to be recognized;
the processor analyzes the input sentence to be recognized and outputs corresponding answer data;
the memory stores the input sentence to be recognized and the corresponding answer data.
In a third aspect, the present invention provides an apparatus comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method as claimed in any one of the above.
In a fourth aspect, the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as defined in any one of the above.
Compared with the prior art, the language identification method provided by the invention has the advantages that a standard vocabulary library storing standard index names is established aiming at the technical field, after the index class description of a sentence is obtained, whether the standard vocabulary library has the completely corresponding standard index name is judged, if yes, the standard index name is taken as the index class information of the sentence, and the language identification in the technical field is improved; if not, the word segmentation is carried out by combining the standard word library and the conventional word library to obtain a word segmentation list, and the word segmentation list is analyzed or the index information is obtained.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a method of language identification in an exemplary embodiment of the invention;
FIG. 2 is a partial flow diagram of a method for language identification in an exemplary embodiment of the invention;
FIG. 3 is a schematic diagram of the regular number acquisition time information in FIG. 2;
FIG. 4 is a block diagram of a language identification system in accordance with yet another exemplary embodiment of the present invention.
Reference numerals:
200-a language identification system; 210-a data receiver; 220-server.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the specific embodiments of the present invention and the accompanying drawings. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 1 is a flow chart of a method of language identification in an exemplary embodiment of the invention. FIG. 2 is a partial flow of a speech recognition method in an exemplary embodiment of the invention. The language identification method of the present invention includes S040, S060, S080, and S100.
S040: and acquiring the index class description of the statement. The vocabulary of the sentence may include temporal descriptions, geographic location vocabulary, data operability descriptions, deactivation vocabulary, and index class descriptions, among others. In the application, after the temporal description, the geographic position vocabulary, the data operability description and the stop vocabulary in the sentence are eliminated, the remaining vocabulary is the index class description.
S060: and judging whether the index class description has a completely corresponding standard index name in the standard vocabulary library. The standard vocabulary library is pre-stored with standard index names in a preset field, the preset field can be a communication field, and the source of the standard index names of the standard vocabulary library can be a report form of a fixed report, an industry standard word formulated by 3GPP or CCSA, or a standard vocabulary set in the industry, and the like. Moreover, the standard index name in the standard vocabulary library can be updated along with the updating and increasing of standard words in the industry so as to improve the language identification capability.
S080: if yes, the standard index name is used as index class information. The standard index name is used as index information, and the index information in the sentence can be correctly determined according to the index information so as to accurately identify the sentence semantics, and further, the sentence semantics can be directly answered, so that the question and answer accuracy of the robot is improved.
S100: if not, performing word segmentation on the index class description by combining the standard vocabulary library and the conventional vocabulary library to obtain a word segmentation list, analyzing the word segmentation list to obtain a target index name, and taking the target index name as the index class information.
The method is different from the common word segmentation based on the common words of the daily expressions, and because the word segmentation of the embodiment of the invention is based on the standard word bank and the conventional word bank, the method can be closer to the special requirements of the professional field and improve the accuracy of language identification. For example, when the word segmentation is performed on "three-fused", the ordinary word segmentation based on the daily expression divides the word into two words, i.e., "three" and "fused", whereas the standard word segmentation based on the communication field in the embodiment of the present invention may not divide "three-fused" or only divide "fused".
After word segmentation, a word segmentation list consisting of a plurality of word segments is obtained, the word segmentation list is analyzed to obtain a target index name, and the target index name is used as index information, so that the accuracy of language identification can be improved.
In S120, the index class description is segmented by combining the standard vocabulary library and the conventional vocabulary library to obtain a segmentation list, which specifically includes S121, S122, and S123.
S121: and extracting a first segmentation word in the index class description, wherein the first segmentation word has a corresponding standard vocabulary in the standard vocabulary library. The standard vocabulary in the vocabulary library is a unit smaller than the standard index name, and the standard index name may include one, two or more standard vocabularies, for example, the standard index name is "triple-fused HSS/HLR basic capacity", and the standard vocabularies include "triple-fused HSS, HLR, and basic capacity". If some vocabulary in the index class description is standard vocabulary, it is extracted and marked as the first participle.
S122: and extracting a second participle in the index class description, wherein the second participle has a corresponding conventional vocabulary in the conventional word bank. After the first segmentation is extracted, if part of vocabularies in the index class description correspond to the conventional vocabularies in the conventional lexicon, the part of vocabularies is extracted and marked as second segmentation. The regular vocabulary in the regular lexicon may be pre-stored vocabulary, which may be generated based on daily life term vocabulary. In addition, the first participle and the second participle may not overlap or may overlap, that is, a certain participle may belong to both the first participle and the second participle.
S123: and combining the first participle and the second participle to obtain the participle list. Specifically, the first participle and the second participle are combined to obtain a participle list. For example, if the first term includes "three-fused, HSS" and the second term includes "fused, grown", then the list of terms is "fused, three-fused, HSS, grown".
In S120, analyzing the word segmentation list to obtain a target index name, which specifically includes: s124, S125, S126 and S127.
S124: and searching a standard index name at least partially corresponding to each participle in the participle list in the standard vocabulary library, and generating a participle set corresponding to each participle according to the standard index name at least partially corresponding to each participle, wherein elements in the participle set are the standard index names in the standard vocabulary library.
Specifically, standard index names corresponding to at least part of each participle (a first participle and a second participle) in the participle list are searched in a traversing manner, and all standard index names corresponding to at least part of the participle are used as elements in a participle set of the participle.
Since the first participles are participles having corresponding standard vocabulary in the standard vocabulary library, each first participle has at least partially corresponding standard index name in the standard vocabulary library. While the second participle may not overlap the first participle at all, i.e. the second participle may not have any corresponding standard vocabulary names in the standard vocabulary library. The participle set of the second participle may be an empty set, i.e. the participle set of possible participles is an empty set.
S125: and taking intersection of the participle sets of all participles.
S126: and if the intersection set is a non-empty set, taking the elements of the intersection set as target index names. When the intersection is a non-empty set, it is indicated that the participle set of each participle has the same element, that is, the element has correlation with all participles, so that the element of the intersection is used as a target index name for obtaining index information subsequently.
S127: and if the intersection is an empty set, taking a union set of the participle sets of all the participles, respectively calculating the similarity between each element in the union set and the participle list, and acquiring a target element from the union set according to the calculation result of the similarity to be used as a target index name. If the intersection is an empty set, the fact that the interior of the participle set of each participle has completely same elements is indicated, therefore, a union of the participle sets is taken, the similarity between each element and the participle list is calculated and concentrated on the basis of an NLP model, and the target index name is obtained according to the structure of the similarity. Specifically, the similarity may be calculated based on a Word2Vec model.
In step S127, respectively calculating the similarity between each element in the union and the word segmentation list, including: s1271, S1272 and S1273.
S1271: and performing word segmentation on each element in the union set according to a standard vocabulary in a standard vocabulary library to obtain an element word segmentation of each element. Specifically, the standard vocabulary in the standard vocabulary library does not include a stop word, in other words, the element segmentation word of each element does not include a stop word, and the stop word refers to a vocabulary without specific meaning in the element, such as "number". For example, one element in the union, "three-fused HSS/HLR number" is participled according to the standard vocabulary in the standard vocabulary library, and the obtained element participle includes "three-fused HSS, HLR".
S1272: and calculating word segmentation similarity in sequence, wherein the word segmentation similarity is the similarity between each element word in each element and each word in the word segmentation list. The second participles in the participle list may not be standard vocabularies in the standard vocabulary library, but the first participles are all standard vocabularies in the standard vocabulary library, so that the similarity between each element participle and each first participle in the participle list can be directly calculated. In other words, the similarity of the second participle to the element participle in each element can be considered to be 0. For example, the participles in the participle list include "merge, triple-merge, HSS, grow", and since "triple-merge, HSS" belongs to the first participle and "merge, grow" belongs to the second participle, in order to calculate the similarity between the element "triple-merge HSS/HLR number" and "merge, triple-merge, HSS, grow" in the participle list, it is only necessary to calculate the similarity between each of "triple-merge, HSS, HLR" and each of "triple-merge, HSS".
S1273: and calculating the average value of all the word segmentation similarity in each element as the similarity of the corresponding element and the word segmentation list. For example, the similarity between each item of the three-fusion, HSS, HLR and each item of the three-fusion, HSS, HLR is averaged to obtain the similarity between the element "three-fusion HSS/HLR number" and the "fusion, three-fusion, HSS, growth" in the participle list.
In step S127, according to the calculation result of the similarity, acquiring the target element from the union set, and using the target element as a target index name, including: s1274 or S1275.
S1274: and taking the element with the highest similarity in the union set as a target element to obtain a target index name. Generally, after the model training is mature, since the language recognition has a higher accuracy, the element with the highest similarity is taken as the target element by default to obtain the target index name.
S1275: and taking the elements with at least the first two names of similarity in the union set as at least two target elements, obtaining at least two target index names, and corresponding the at least two target elements to each other. At the early stage of model training, because the accuracy of language identification may not be high, the first two or three names with the highest similarity can be taken from the union as target elements, two or three corresponding target index names are generated through two or three target elements, and two or three corresponding index class information is generated, that is, two or three SQL statements can be generated, a user can perform evaluation feedback on each SQL statement, and according to the evaluation feedback of the user, the Word2Vec model can be iterated to improve the accuracy of the model.
And storing each question and answer about the index class description in the mapping library, and if the answer user is satisfied, forming positive feedback in the mapping library, and if the answer user is not satisfied, forming negative feedback in the mapping library. Therefore, in S1275, the mapping relationship between the at least two target elements and the corresponding target index names may also be stored in the mapping library. Specifically, the mapping library is established based on the Word2Vec model, and the similarity relationship between the Word segmentation element and the Word segmentation list of each element can be recorded.
Prior to S040, S021 and/or S022 and/or S023 and/or S024 are also included.
S021: and extracting the temporal description of the sentence to obtain the time information. Specifically, the temporal information is obtained from the temporal description using a regular number or a decision tree. The regular tree has pre-stored temporal descriptions, and the temporal descriptions can be correspondingly matched from three time dimensions of year, month and day in the embodiment of the invention. According to the temporal description of the sentence, after the matching of the adult is completed, the next layer node of the regular tree is entered, the matching of the month is completed, and finally the matching of the date is completed, and the time information is obtained, for example, the temporal description in the sentence includes "the first monday of the last 3 months", and the time information obtained after the analysis is extracted and analyzed by using the regular tree is "3 months and 5 days in 2018". FIG. 3 is a diagram illustrating extraction time information of a regular tree according to an embodiment of the present invention.
Of course, the temporal descriptions may also be correspondingly matched from at least two time dimensions of any of year, month, day, time, and ranking dimensions.
The temporal descriptions pre-stored in the regular tree include, but are not limited to, the following descriptions: last year, previous year, last two years, last month, 7 month, 8 month, last week, last worship, yesterday, previous day, monday, tuesday, 4 days, No. 4, No. 5 days, first, second, or second, and so forth.
S022: and extracting the geographical location description of the sentence to obtain geographical location information. In the embodiment of the invention, the geographical position information can be obtained based on a regular number or a decision tree. The geographical location description pre-stored in the regular tree includes names of provinces, names and aliases of cities, and common descriptions.
S023: and extracting the data operability description of the statement to obtain data operation information. The data manipulation information may be obtained based on a regular number or a decision tree. The operational vocabulary of the data pre-stored in the regular tree includes, but is not limited to, the following descriptions: ring ratio, identity ratio, increase, decrease, increase or decrease, and the like.
S024: the disabled vocabulary of the sentence is culled. The stop words refer to words without meaning in the sentences, and can be removed for improving efficiency. Specifically, it may be determined whether there is a vocabulary in the sentence corresponding to a vocabulary in the shutdown library based on the shutdown library, and if so, it is determined that the vocabulary in the sentence is a shutdown vocabulary and should be rejected.
In practice, if there is no temporal description in the sentence, the above S021 is not performed, and the latest time or the current time may be directly used as the time information. If the sentence has no geographical location description, not performing the step S022; if there is no data operation description in the statement, the above S023 is not performed.
In addition, if the sentence has temporal description, geographical location description, data operability description and index class description, the sentence is correspondingly described and extracted according to the sequence of extracting the temporal description, the geographical location description, the data operability description and the index class description. Specifically, after the temporal description of the sentence is extracted, the remaining description in the sentence does not contain the temporal description (called the sentence to be extracted with the geographic location information), the sentence to be extracted with the geographic location information is extracted, the remaining description in the sentence does not contain the temporal and geographic location description (called the sentence to be extracted with the data operation information), and so on, so as to improve the extraction efficiency of the description, and make the language identification have higher efficiency.
After steps S080 and S100, S120 is also included.
S120: and recognizing to obtain statement semantics according to the index information, the time information and/or the geographic position information and/or the data operation information. Specifically, the SQL statement is generated according to at least two items of the time information, the geographic position information and the data operation information by the index type information.
Prior to step S040, the method further includes step S020: an input sentence to be recognized is received. The input sentence to be recognized may be a question sentence or, of course, a statement sentence.
Correspondingly, S120 identifies the semantics of the input sentence, i.e., identifies the semantics of the input question sentence or statement sentence.
S122, S124 and S126 are also included after step S120.
S122: and outputting corresponding answer data according to the semantics of the input statement. And inquiring a data result corresponding to the statement according to the statement semantics, and converting the data result into answer data to be output.
S124: and receiving feedback information whether the user is satisfied with the answer data. When the user is satisfied with the answer data, the language identification is correct, so that positive feedback information can be received; the user is not satisfied with the answer data, indicating that the speech recognition is erroneous, and thus can receive negative feedback information.
S126: and updating the mapping relation between the input statement to be recognized and the answer data in a mapping library according to the feedback information. Because the feedback information comprises positive feedback information and negative feedback information, if the positive feedback information is received, the output answer data and the input sentence are in correct corresponding relation, the record is given in the mapping library, and the answer data can be directly called and output from the mapping library when the input sentence to be identified next time is the same as the input sentence at this time. If negative feedback information is received, the output answer data and the input sentence are in wrong corresponding relation, a record is given in a mapping library, and the answer data is prevented from being output when the input sentence to be identified next time is the same as the input sentence at this time.
Due to the establishment of the mapping library, between steps S020 and S040, S030 may be further included: judging whether the input sentence to be recognized has corresponding correct answer data in the mapping library: if yes, directly outputting the answer data and finishing the language identification; if not, it is further determined whether there is any wrong answer data, and if so, the word related to the wrong answer data is eliminated, otherwise, the process proceeds to S040.
In embodiments of the present invention, the standard vocabulary library may be a subset of the mapping library.
The language identification method can be a robot applied to the communication field, the robot can be suitable for a working scene, and a worker asks questions of the robot and the robot answers the questions to assist the worker to quickly inquire relevant information. Of course, the method can also be applied to other technical fields, and is not described in detail.
The above is a flow of the speech recognition method in the embodiment of the present invention, and how to perform speech recognition is described below with reference to a specific embodiment.
When an input statement is 'what the ratio of the increase ring of the number of three fused HSS in the last month Zunyi city is', the input statement is firstly subjected to temporal description to extract time information, and the 'last month' corresponds to 3 months in 2019; secondly, matching the geographical position to obtain 'Zunyi city'; extracting the data operation description to obtain a ring ratio; . After the extraction is finished, the stop words of the number are removed to obtain a statement of three-fusion HSS growth, and the three-fusion HSS growth is the index description.
In the "three-fusion HSS growth", the "three-fusion HSS growth" is a standard vocabulary in a standard vocabulary library and is therefore a first participle, and the "fusion HSS growth" is a conventional vocabulary in a conventional vocabulary library and is therefore a second participle, and a participle list "fusion HSS growth", which is obtained by combining the first participle and the second participle, is obtained.
And respectively matching the participles of fusion, triple fusion, HSS and growth to obtain a participle set of each participle based on a standard vocabulary library.
The set 1 is a word segmentation set of "fusion", specifically:
{ 'triple-fusion HSS/HLR number', 'triple-fusion HSS/HLR basic capacity', 'triple-fusion HSS/HLR VoLTE capacity' }
The set 2 is a word segmentation set of "three-fusion", specifically:
{ 'triple-fusion HSS/HLR number', 'triple-fusion HSS/HLR basic capacity', 'triple-fusion HSS/HLR VoLTE capacity' }
The set 3 is a participle set of "HSS", specifically: the number of three-fusion HSS/HLR, the volume of three-fusion HSS/HLR VoLTE, the volume of fixed network IMS HSS, the number of fixed network IMS HSS and the volume of three-fusion HSS/HLR base' }
Set 4 is a "growing" set of participles, and since "growing" is not a standard vocabulary in the standard vocabulary library, set 4 is an empty set.
And obtaining an empty set after the intersection of the 4 sets, so that the union of the sets is obtained to obtain a set of 5 elements.
The union is as follows: the number of HSS of the fixed network IMS, the number of VoLTE/HSS of three-fusion, the number of HSS/HLR of the fixed network IMS, the number of HSS/HLR of three-fusion and the basic capacity of HSS/HLR of three-fusion are defined as follows.
Similarity calculations are then performed on each element in the union with each element in the participle list [ 'fused', 'tri-fused', 'HSS', 'grown' ] to yield the following results. Since the 'fusion' and 'growth' are not in the standard vocabulary library, namely, are not in the training vocabulary, the related similarity can not be calculated in the actual calculation. And the calculation results of the similarity between each element and the word segmentation category are respectively shown in tables 1 to 5.
TABLE 1
Figure BDA0002349238180000131
TABLE 2
Figure BDA0002349238180000141
TABLE 3
Figure BDA0002349238180000142
TABLE 4
Figure BDA0002349238180000151
TABLE 5
Figure BDA0002349238180000152
Figure BDA0002349238180000161
According to the above tables 1 to 5, it can be seen that [ the number of the three-fused HSS/HLR ] is a standard index name with the highest similarity, and therefore, it is used as the standard index name. The question before backsight, "how much the growth ring ratio of the three-fusion HSS number in the city was followed in the last month", is mainly to inquire the index description of the [ three-fusion HSS number ], so that the identified language semantics can be matched with the subject matter of the problem.
FIG. 4 is a block diagram of a language identification system in accordance with yet another exemplary embodiment of the present invention. The speech recognition system 200 includes a data receiver 210 and a server 220.
The data receiver 210 receives data to be inspected. The server 220 includes a first port for receiving an input sentence to be recognized, a memory, and a processor for analyzing the input sentence to be recognized and outputting corresponding answer data. The memory stores the input sentence to be recognized and the corresponding answer data.
The service management system provided in the embodiment of the present application may further execute the method executed by the language identification system in fig. 1 or fig. 2, and implement the functions of the language identification system in the embodiment shown in fig. 1 or fig. 2, which are not described herein again.
An embodiment of the present invention further provides an apparatus, including: a memory, a processor and a computer program stored on said memory and executable on said processor, the computer program realizing the steps of the above-mentioned language identification method when executed by said processor.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the embodiment of the language identification method, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of language identification, comprising:
acquiring index class description of a statement;
judging whether the index class description has a completely corresponding standard index name in a standard vocabulary library or not;
if yes, the standard index name is used as index class information;
if not, performing word segmentation on the index class description by combining the standard vocabulary library and the conventional vocabulary library to obtain a word segmentation list, analyzing the word segmentation list to obtain a target index name, and taking the target index name as the index class information.
2. The method according to claim 1, wherein the segmenting the index class description into the segmented words by combining the standard vocabulary library and the conventional vocabulary library to obtain a segmented word list comprises:
extracting a first participle in the index class description, wherein the first participle has a corresponding standard vocabulary in the standard vocabulary library;
extracting a second participle in the index class description, wherein the second participle has a corresponding conventional vocabulary in the conventional word bank;
and combining the first participle and the second participle to obtain the participle list.
3. The method according to claim 1, wherein the parsing the word segmentation list to obtain a target index name comprises:
searching a standard index name at least partially corresponding to each participle in a participle list in the standard vocabulary library, and generating a participle set corresponding to each participle according to the standard index name at least partially corresponding to each participle, wherein elements in the participle set are the standard index names in the standard vocabulary library;
taking intersection of the participle sets of all participles;
if the intersection is a non-empty set, taking the elements of the intersection as target index names;
and if the intersection is an empty set, taking a union set of the participle sets of all the participles, respectively calculating the similarity between each element in the union set and the participle list, and acquiring a target element from the union set according to the calculation result of the similarity to be used as a target index name.
4. The method according to claim 3, wherein the calculating the similarity of each element in the union and the word segmentation list respectively comprises:
dividing words of each element in the parallel set according to standard words in a standard word library to obtain element division words of each element;
calculating word segmentation similarity in sequence, wherein the word segmentation similarity is the similarity between each element word in each element and each word in the word segmentation list;
and calculating the average value of all the word segmentation similarity in each element as the similarity of the corresponding element and the word segmentation list.
5. The language identification method according to claim 3, wherein the obtaining of the target element from the union as the target index name according to the calculation result of the similarity comprises:
taking the element with the highest similarity in the union set as a target element to obtain a target index name; or
And taking the elements with at least the first two names of similarity in the union set as at least two target elements to obtain at least two target index names.
6. The language identification method according to claim 1, further comprising, before the index class description vocabulary of the extracted sentence:
extracting the temporal description of the sentence, and obtaining the time information: and/or
Extracting the geographical location description of the sentence to obtain geographical location information; and/or
Extracting data operational description of the statement to obtain data operational information; and/or
Removing the disabled vocabulary of the sentence;
after the obtaining of the index class information, the method further includes:
and recognizing to obtain statement semantics according to the index information, the time information and/or the geographic position information and/or the data operation information.
7. The language identification method according to claim 6, wherein the obtaining of the index class description of the sentence is preceded by: receiving an input sentence to be recognized;
in the sentence semantics identified according to the index class information, the time information and/or the geographic location information and/or the data operation information, the method includes: identifying and obtaining the semantics of an input statement according to the index information, the time information and/or the geographic position information and/or the data operation information;
after the sentence semantics are identified and obtained according to the index class information, the time information and/or the geographic position information and/or the data operation information, the method comprises the following steps:
outputting corresponding answer data aiming at the semantics of the input sentence;
receiving feedback information whether the user is satisfied with the answer data;
and updating the mapping relation between the input statement to be recognized and the answer data in a mapping library according to the feedback information.
8. A speech recognition system comprising a data receiver and a server; the data receiver collects the language to be recognized;
the server comprises a first port, a memory and a processor;
the first port is used for receiving an input sentence to be recognized;
the processor analyzes the input sentence to be recognized and outputs corresponding answer data;
the memory stores the input sentence to be recognized and the corresponding answer data.
9. An apparatus, comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN201911408163.2A 2019-12-31 2019-12-31 Language identification method, system and device Active CN113128216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911408163.2A CN113128216B (en) 2019-12-31 2019-12-31 Language identification method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911408163.2A CN113128216B (en) 2019-12-31 2019-12-31 Language identification method, system and device

Publications (2)

Publication Number Publication Date
CN113128216A true CN113128216A (en) 2021-07-16
CN113128216B CN113128216B (en) 2023-04-28

Family

ID=76770229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911408163.2A Active CN113128216B (en) 2019-12-31 2019-12-31 Language identification method, system and device

Country Status (1)

Country Link
CN (1) CN113128216B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160217144A1 (en) * 2013-09-04 2016-07-28 Zte Corporation Method and device for obtaining web page category standards, and method and device for categorizing web page categories
CN108021553A (en) * 2017-09-30 2018-05-11 北京颐圣智能科技有限公司 Word treatment method, device and the computer equipment of disease term
CN108304411A (en) * 2017-01-13 2018-07-20 中国移动通信集团辽宁有限公司 The method for recognizing semantics and device of geographical location sentence
CN109213994A (en) * 2018-07-26 2019-01-15 深圳市元征科技股份有限公司 Information matching method and device
CN109670177A (en) * 2018-12-20 2019-04-23 翼健(上海)信息科技有限公司 One kind realizing the semantic normalized control method of medicine and control device based on LSTM
CN109800416A (en) * 2018-12-14 2019-05-24 天津大学 A kind of power equipment title recognition methods
CN110019418A (en) * 2018-01-02 2019-07-16 中国移动通信有限公司研究院 Object factory method and device, mark system, electronic equipment and storage medium
CN110222709A (en) * 2019-04-29 2019-09-10 上海暖哇科技有限公司 A kind of multi-tag intelligence marking method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160217144A1 (en) * 2013-09-04 2016-07-28 Zte Corporation Method and device for obtaining web page category standards, and method and device for categorizing web page categories
CN108304411A (en) * 2017-01-13 2018-07-20 中国移动通信集团辽宁有限公司 The method for recognizing semantics and device of geographical location sentence
CN108021553A (en) * 2017-09-30 2018-05-11 北京颐圣智能科技有限公司 Word treatment method, device and the computer equipment of disease term
CN110019418A (en) * 2018-01-02 2019-07-16 中国移动通信有限公司研究院 Object factory method and device, mark system, electronic equipment and storage medium
CN109213994A (en) * 2018-07-26 2019-01-15 深圳市元征科技股份有限公司 Information matching method and device
CN109800416A (en) * 2018-12-14 2019-05-24 天津大学 A kind of power equipment title recognition methods
CN109670177A (en) * 2018-12-20 2019-04-23 翼健(上海)信息科技有限公司 One kind realizing the semantic normalized control method of medicine and control device based on LSTM
CN110222709A (en) * 2019-04-29 2019-09-10 上海暖哇科技有限公司 A kind of multi-tag intelligence marking method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ROBERT LEAMAN 等: "Challenges in clinical natural language processing for automated disorder normalization", 《JOURNAL OF BIOMEDICAL INFORMATICS》 *
王琴: "一种实用智能答疑系统的设计与实现", 《计算机与现代化》 *

Also Published As

Publication number Publication date
CN113128216B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN108804521B (en) Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system
CN106033416B (en) Character string processing method and device
CN109582799B (en) Method and device for determining knowledge sample data set and electronic equipment
US7281001B2 (en) Data quality system
CN111026886B (en) Multi-round dialogue processing method for professional scene
CN108182207B (en) Intelligent coding method and system for Chinese surgical operation based on word segmentation network
CN105487663A (en) Intelligent robot oriented intention identification method and system
CN103593412B (en) A kind of answer method and system based on tree structure problem
CN105138507A (en) Pattern self-learning based Chinese open relationship extraction method
CN110096581B (en) System and method for establishing question-answer system recommendation questions based on user behaviors
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN109977398A (en) A kind of speech recognition text error correction method of specific area
CN109918664B (en) Word segmentation method and device
CN109657063A (en) A kind of processing method and storage medium of magnanimity environment-protection artificial reported event data
CN114003709A (en) Intelligent question-answering system and method based on question matching
US10380065B2 (en) Method for establishing a digitized interpretation base of dongba classic ancient books
CN112445894A (en) Business intelligent system based on artificial intelligence and analysis method thereof
CN113064980A (en) Intelligent question and answer method and device, computer equipment and storage medium
CN110782892A (en) Voice text error correction method
CN110717021A (en) Input text and related device for obtaining artificial intelligence interview
CN112287082A (en) Data processing method, device, equipment and storage medium combining RPA and AI
CN113157875A (en) Knowledge graph question-answering system, method and device
CN113128216A (en) Language identification method, system and device
CN113111157B (en) Question-answer processing method, device, computer equipment and storage medium
CN113254668B (en) Knowledge graph construction method and system based on scene latitude

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant