CN114611504A

CN114611504A - User speech and risk user identification method and related device

Info

Publication number: CN114611504A
Application number: CN202210209634.2A
Authority: CN
Inventors: 梁嘉迪; 谢吉松; 上官亚力
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-10

Abstract

The user speech recognition method comprises the steps of firstly obtaining vocabulary information in user speech to be recognized and pinyin information corresponding to the vocabulary information, then fusing the vocabulary information and the pinyin information to obtain comprehensive characteristic information, inputting the comprehensive characteristic information obtained through fusion into a user speech model obtained through training, and recognizing the user speech, so that the user speech is efficiently recognized, and the improper speech of a user is accurately recognized.

Description

User speech and risk user identification method and related device

Technical Field

The application relates to the technical field of social contact, in particular to a method and a related device for identifying a user who speaks and risks.

Background

Currently, many network software have social functions, such as social functions in network games.

In the prior art, for bad users who use social contact function to conduct violation behaviors, the methods mainly depend on means such as a reporting system, word language keyword matching, manual review and the like, however, with the mass growth of users, the methods become inefficient and unilateral.

Disclosure of Invention

In view of the above technical problems, there is a need for an improved method for efficiently and comprehensively identifying risky users and abnormal speeches of users.

The exemplary embodiment of the present application provides a method for recognizing a user utterance, including:

acquiring vocabulary information in a user speech to be recognized;

determining pinyin information corresponding to the vocabulary information based on the vocabulary information;

fusing the vocabulary information and the pinyin information to obtain comprehensive characteristic information;

and inputting the comprehensive characteristic information into a user speech model obtained by training, and identifying the user speech.

In some exemplary embodiments, the process of training the user speech model includes:

acquiring sample vocabulary information in a sample language to be input, and replacing part of vocabularies in the sample vocabulary information;

acquiring sample pinyin information corresponding to the replaced sample vocabulary information;

and fusing the replaced sample vocabulary information and the sample pinyin information, and inputting the fused sample vocabulary information and the sample pinyin information into an initial user speech model to train the user speech model.

In some exemplary embodiments, fusing the vocabulary information and the pinyin information specifically includes:

converting the vocabulary information into a first word vector, and converting the pinyin information into a second word vector;

and splicing the first word vector and the second word vector together to realize the fusion of the vocabulary information and the pinyin information.

Based on the same inventive concept, the exemplary embodiment of the present application further provides a method for identifying a risky user, including:

recognizing the speech of the current user in the first preset time by the recognition method of the speech of the user, and determining the first number of abnormal speech of the current user in the first preset time based on the recognition result;

acquiring a second number of private chat users of the current user within a first preset time and a third number of private chat risk users of the current user within the first preset time;

and inputting the first quantity, the second quantity and the third quantity into a risk user identification model obtained by training so as to identify whether the current user is a risk user.

In some exemplary embodiments, the inputting the first number, the second number, and the third number into a risk user identification model obtained by training to identify whether the current user is a risk user includes:

obtaining the punished times of the current user within a second preset time;

inputting the punished times, the first quantity, the second quantity and the third quantity into a risk user identification model obtained by training so as to identify whether the current user is a risk user.

in response to determining that the first number is 0, determining the current user as a non-risky user;

in response to determining that the first number is greater than or equal to 1, inputting the first number, the second number and the third number into a risk user identification model obtained by training to identify whether the current user is a risk user.

In some exemplary embodiments, recognizing the speech of the current user within the first preset time by the recognition method of the speech of the user as described above, and determining the first number of abnormal speech of the current user within the first preset time based on the recognition result includes:

for each language of the current user in a first preset time, identifying the language by the above-mentioned identification method of the user language, determining whether the language is an abnormal language based on the identification result of the language, determining whether the language comprises preset keywords in response to determining that the language is not the abnormal language, and determining that the language is the abnormal language in response to determining that the language comprises the preset keywords;

counting the number of all the abnormal speeches in a first preset time to determine the first number.

Based on the same inventive concept, the exemplary embodiments of the present application further provide an apparatus for recognizing a user utterance, including:

the vocabulary acquisition module is used for acquiring vocabulary information in the user speech to be recognized;

the pinyin determining module is used for determining pinyin information corresponding to the vocabulary information based on the vocabulary information;

the fusion module fuses the vocabulary information and the pinyin information to obtain comprehensive characteristic information;

and the speech recognition module is used for inputting the comprehensive characteristic information into a user speech model obtained through training and recognizing the user speech.

Based on the same inventive concept, the exemplary embodiments of the present application further provide an apparatus for identifying a risky user, including:

the determining module is used for identifying the speech of the current user in the first preset time by the user speech identifying method, and determining the first number of abnormal speech of the current user in the first preset time based on the identification result;

the quantity acquisition module is used for acquiring a second quantity of the current users in private chat within a first preset time and a third quantity of the current users in private chat risk within the first preset time;

and the user identification module is used for inputting the first quantity, the second quantity and the third quantity into a risk user identification model obtained by training so as to identify whether the current user is a risk user.

Based on the same inventive concept, the exemplary embodiments of this application also provide an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the program to implement the method for identifying a user utterance or the method for identifying a risky user as described above.

Based on the same inventive concept, exemplary embodiments of the present application also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method for identifying a user utterance or the method for identifying a risky user as described above.

As can be seen from the above, according to the recognition method of the user speech provided in the embodiment of the present application, the vocabulary information in the user speech to be recognized and the pinyin information corresponding to the vocabulary information are obtained, then the vocabulary information and the pinyin information are fused to obtain the comprehensive characteristic information, and the comprehensive characteristic information obtained by fusion is input into the user speech model obtained by training to recognize the user speech, so that the user speech is recognized efficiently, and the improper speech of the user is recognized accurately.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for recognizing a user utterance in an exemplary embodiment of the present application;

fig. 2 is a flowchart illustrating an identification method of a risky user in an exemplary embodiment of the present application;

FIG. 3 is a schematic flow chart of training a risky user recognition model in an exemplary embodiment of the present application;

fig. 4 is a schematic structural diagram of a recognition apparatus for user speech in an exemplary embodiment of the present application;

fig. 5 is a schematic structural diagram of an identification apparatus for an at risk user in an exemplary embodiment of the present application;

fig. 6 is a schematic structural diagram of a specific electronic device in an exemplary embodiment of the present application.

Detailed Description

The principles and spirit of the present application will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present application, and are not intended to limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

According to the embodiment of the application, a method and a related device for identifying a user speaking and a risk user are provided.

In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.

The principles and spirit of the present application are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

In the prior art, for users with bad risks who use social contact function to conduct violation behaviors, the methods mainly depend on a reporting system, word-of-speech keyword matching, manual review and the like.

The reporting system reports by means of user spontaneity, and reporting is an additional action for users, so that the number of bad users which can be found by the reporting system is very limited under the condition of non-continuous targeted speech disturbance. And the waiting user reports a bad user discovery mode which is quite passive for an operator, and the attack cannot be actively carried out. In addition, minors are poorly discriminative and many at risk users are tricky and spoofed, in which case the reporting mechanism is often ineffective.

Keyword matching requires the maintenance of a complex regular expression, and the writing of this regular expression requires a large amount of manual screening. Meanwhile, since the regular expression is preset, a user can use a new expression mode to escape from the matching of the regular expression, so that the regular expression needs to be continuously maintained and optimized, which brings a large amount of labor cost for operation. In addition, in order to avoid sensitive words, users with high social risk can use a large number of various shapes of near characters, phonetic near characters, English letters and the like for substitution, such as 'Jia-me Wei Xin', and the like, so that keyword matching is difficult to complete and accurate.

The manual auditing platform needs manpower to qualify each speech under extreme conditions, users are numerous, most of the users are normal users, and only a few users with high social risk are mixed in the manual auditing platform. This makes the process of determining whether a user is at high social risk by simply manually reviewing the screen a lot of labor and time, which is inefficient and labor-intensive. Meanwhile, because manual examination is carried out after the utterance is sent out, the reaction cannot be carried out in time, and the qualitative analysis of the social risk user has hysteresis.

In order to solve the above problem, the present application provides a method for recognizing a user utterance, which specifically includes:

acquiring vocabulary information in a user speech to be recognized and pinyin information corresponding to the vocabulary information, and fusing the acquired vocabulary information and the pinyin information to obtain comprehensive characteristic information; the comprehensive characteristic information comprises the vocabulary characteristics in the user speech and the pinyin characteristics corresponding to the vocabulary characteristics, and then the comprehensive characteristic information is input into a user speech model obtained through training as a whole to identify the user speech, so that the two characteristics of the vocabulary and the pinyin in the user speech can be comprehensively considered when the neural network model identifies the user speech, the accuracy of the user speech identification is improved, the user is prevented from escaping from the identification of the abnormal speech of the user by a system through the forms of characters, homophones and the like, meanwhile, the neural network is used for replacing manual examination one by one, and the efficiency of the user speech identification is further improved.

Having described the basic principles of the present application, various non-limiting embodiments of the present application are described in detail below.

Application scene overview

In some specific application scenarios, the method for identifying a user who says or risks in the present application may be directly applied to a system with a social function as a part of the system with a social function, and optionally, the system with a social function may be a game system. The method for identifying the user speech or the risk user can also be independently applied to an evaluation system, then various data information such as the user speech and the like are obtained from a system with a social function and input into the evaluation system, and the evaluation system identifies the user speech or the risk user through the method.

In some specific application scenarios, the method for identifying the user speech or the risk user can be directly applied to local operation and can also be operated in a cloud server. When the cloud server runs, the acquired data to be processed is sent to the cloud server through the network, the server processes the data to be processed through the user speech or risk user identification method, and the processing result is sent to the local through the network.

The following describes a method for identifying a user who speaks or is at risk according to an exemplary embodiment of the present application, with reference to a specific application scenario. It should be noted that the above application scenarios are only presented to facilitate understanding of the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

Exemplary method

Referring to fig. 1, an embodiment of the present application provides a user speech, including the following steps:

s101, obtaining vocabulary information in the user speech to be recognized and pinyin information corresponding to the vocabulary information.

In specific implementation, when the user speech is recognized, the vocabulary information in the user speech to be recognized is obtained first, and optionally, the user speech generally refers to the character speech input by the user. Alternatively, the vocabulary information may include one or more vocabularies. In some embodiments, after obtaining the vocabulary information in the speech of the user to be recognized, data cleaning may be performed on the vocabulary information, so as to filter out invalid information, such as words and text.

In some exemplary embodiments, before determining pinyin information corresponding to the vocabulary information based on the vocabulary information, the method further includes:

determining whether the vocabulary information contains detachable vocabularies or not based on a preset vocabulary table;

responding to the fact that the vocabulary information contains the detachable vocabulary, splitting the detachable vocabulary into a plurality of vocabularies, and re-acquiring the vocabulary information to be processed in the user speech based on the plurality of vocabularies after splitting.

In particular implementation, currently, in network socializing, some users may use some detachable words to express some rather dull meanings, such as "ramming" to express "strong force". In order to identify the abnormal speech containing the splittable vocabulary, after the vocabulary information in the user speech is obtained, whether the vocabulary information contains the splittable vocabulary or not is determined according to a preset vocabulary, and the preset vocabulary can be set as required without limitation. When the vocabulary information is determined to contain the detachable vocabulary, the detachable vocabulary is detached into a plurality of vocabularies, the vocabulary information in the user speech is obtained again according to the detached vocabularies, and then the subsequent steps from S102 to S104 are executed again.

And S102, determining corresponding pinyin information based on the vocabulary information.

In specific implementation, after vocabulary information in the user speech is acquired, pinyin information corresponding to the vocabulary information is determined according to the vocabulary information, for example, certain vocabulary information is: if "I has been on-line", the pinyin information corresponding to the vocabulary information is "wo shang xian le". Optionally, when a multi-syllable word is encountered, one pinyin in the dictionary is taken as pinyin information corresponding to the word information, so that the neural network model can be ensured to avoid different recognition results caused by different determined syllables in learning and recognition.

S103, fusing the vocabulary information and the pinyin information to obtain comprehensive characteristic information.

In specific implementation, after the vocabulary information and the pinyin information corresponding to the vocabulary information are obtained, the vocabulary information and the pinyin information corresponding to the vocabulary information are fused, and then comprehensive characteristic information is obtained, so that the comprehensive characteristic information comprises both the vocabulary characteristics and the pinyin characteristics corresponding to the vocabulary characteristics.

In some exemplary embodiments, the vocabulary information and the pinyin information are fused; the method specifically comprises the following steps:

In specific implementation, when the vocabulary information and the pinyin information are fused, the vocabulary information is firstly converted into a first word vector, and the pinyin information is converted into a second word vector. The process of specifically implementing word vector conversion is not limited herein, and any means in the prior art may be used to implement word vector conversion, for example, word2vec (a correlation model for generating word vectors) may be used to implement word vector conversion. After the first word vector and the second word vector are obtained, the first word vector and the second word vector are directly spliced, so that the fusion of the vocabulary information and the pinyin information can be realized, and further the comprehensive characteristic information is obtained. Suppose that certain vocabulary information is: the first word vector corresponding to "today's weather is really good" is [1,2,3,4,5,6], the second word vector of the pinyin information corresponding to the vocabulary information is [7,8,9,10,11,12], and then the comprehensive feature information obtained after splicing at this time is [1,2,3,4,5,6,7,8,9,10,11,12 ].

It should be noted that the above method for fusing the vocabulary information and the pinyin information is only a specific implementation manner provided in the present application, and those skilled in the art may also use other methods to fuse the vocabulary information and the pinyin information, for example, each vocabulary in the vocabulary information may be added with its corresponding pinyin information at the back, thereby implementing the fusion of the vocabulary information and the pinyin information.

And S104, inputting the comprehensive characteristic information into a trained user speech model, and identifying the user speech.

In specific implementation, after the comprehensive characteristic information is obtained, the comprehensive characteristic information is input into a user speech model obtained through training, and the user speech is recognized. Optionally, the recognition results of the user-speech model include normal speech and abnormal speech, including abuse speech, pornographic speech, political speech, and the like.

In specific implementation, in order to further improve the anti-interference performance of the user speech model, when the model training is performed, before the sample vocabulary information in the sample speech is input into the initial user speech model, a part of vocabularies of the sample vocabulary information to be input can be replaced. Optionally, the replacement process may be performed randomly, or some sensitive words may be actively replaced with similar characters or homophones. After the replacement is finished, obtaining sample pinyin information corresponding to the replaced sample vocabulary information according to the replaced sample vocabulary information, fusing the replaced sample vocabulary information and the sample pinyin information, and finally inputting the fused sample vocabulary information and the sample pinyin information into an initial user speech model as a whole so as to train the user speech model.

It should be noted that, the selection of the initial user speech model is not limited herein, and optionally, the initial user speech model may be a bert-base, and a person skilled in the art may also select any model from existing neural network models as an initial model according to needs.

In some exemplary embodiments, due to differences in the number of different classes of user speech identified by the user speech model, cross entropy losses (focalloss) are employed as a loss function in training the user speech model, thereby reducing training impact caused by class number bias.

According to the recognition method of the user speech, the vocabulary information in the user speech to be recognized and the pinyin information corresponding to the vocabulary information are obtained firstly, then the vocabulary information and the pinyin information are fused to obtain the comprehensive characteristic information, the comprehensive characteristic information obtained through fusion is input into a user speech model obtained through training, the user speech is recognized, and therefore the user speech is recognized efficiently and accurately, and the improper speech of the user is recognized accurately.

Referring to fig. 2, a flow chart of an identification method for a risky user in an exemplary embodiment of the present application is schematically illustrated, where the method includes the following steps:

s201, the speech of the current user in the first preset time is identified through the user speech identification method, and the first number of abnormal speech of the current user in the first preset time is determined based on the identification result.

In particular, the risk user in the present application refers to a user who often issues some dishonest statements and is easily harmed by other users. When the risk user identification is performed, it is necessary to determine the first number of abnormal speeches of each user in a first preset time, where the first preset time may be set as needed, for example, the first preset time is set to 24 hours. When the first number is determined, the speech of the current user in the first preset time can be identified through the identification method of the speech of the user, and the first number of the abnormal speech of the current user in the first preset time is determined based on the identification result. Optionally, the recognition result includes that the current user's speech is normal speech or that the user's speech is abnormal speech.

In some exemplary embodiments, when the users are identified through the risk user identification model, clustering reorganization may be performed on the obtained user data, for example, utterances of each user within a preset time are combined together through a user ID, and are represented in a JSON format: the forms of [ { "user A" [ "say one", "say two" ] }, { "user B" [ "say three", "say four" ] }.

In some exemplary embodiments, recognizing the speech of the current user within the first preset time by the recognition method of the speech of the user, and determining the first number of abnormal speech of the current user within the first preset time based on the recognition result includes:

for each language of the current user in a first preset time, recognizing the language by the method of any one of claims 1 to 4, determining whether the language is an abnormal language based on a recognition result of the language, determining whether the language comprises preset keywords in response to determining that the language is not the abnormal language, and determining that the language is the abnormal language in response to determining that the language comprises the preset keywords;

In specific implementation, for each speech of each user in a first preset time, determining whether the speech is an abnormal speech according to the recognition method of the speech of the user; when the speech is determined to be abnormal by the recognition method of the user speech, directly outputting the result, when the speech is determined to be abnormal by the recognition method of the user speech, further determining whether the speech comprises preset keywords, and when the speech is determined to comprise the preset keywords, determining that the speech is abnormal; only when the user speech recognition method determines that the speech is not the abnormal speech and the speech does not include the preset keywords, the speech is determined not to be the abnormal speech, so that the accuracy of the user speech recognition is further improved, and the abnormal speech of some users is avoided being omitted. Optionally, the preset keywords may be set as needed, and are not limited herein. After each utterance of the user in a first preset time is identified, the number of all the abnormal utterances in the first preset time can be counted to determine the first number.

S202, obtaining a second number of the current users in private chat within a first preset time and a third number of the current users in private chat risk within the first preset time.

In specific implementation, after determining the first number of abnormal speeches of the current user within the first preset time, a second number of private chat users of the current user within the first preset time and a third number of private chat risk users of the current user within the first preset time need to be further obtained. It should be noted that the risk user may be a risk user already identified by the method of the present application, or may be a risk user identified by another method, which is not limited herein.

S203, inputting the first quantity, the second quantity and the third quantity into a risk user identification model obtained by training so as to identify whether the current user is a risk user.

In specific implementation, after the first number, the second number, and the third number are obtained, the numbers may be input into a risk user identification model obtained by training to identify whether the current user is a risk user. Optionally, the output result of the identification by the risky user identification model may be a clearly determined category, such as risky users and non-risky users, and the risky user identification model may also output the probability of whether the users are high social risk users, which is not limited herein. Optionally, the risk user identification model may select any one existing neural network model as an initial risk user identification model, where no limitation is made, and then train through a large amount of data of sample users to finally obtain the risk user identification model.

In some exemplary embodiments, when the risk user identification model is trained, a layered triple-fold cross validation is used, as shown in fig. 3, users are divided into three parts, the proportions of risk-free users and risk users in each part are the same, two parts of the three parts are used as a training set, the remaining part is used as a validation set, three combination modes are provided, the model is trained, and the obtained model accuracy calculation average value is the accuracy of the training. Optionally, in training the risk user recognition model, the parameter may be adjusted by the following loss function:

wherein loss represents a loss function of the risk user identification model, α and β represent class weights, m represents the number of training samples, x_iRepresenting model input, y_iA category label is represented. It should be noted that, during model training, the number of samples of the risk user category is small, which results in uneven input sample categories, and therefore, the category weight is set in the loss function, thereby avoiding the accuracy of model identification due to unbalanced sample categories.

obtaining the punished times of the current user within a second preset time;

inputting the punished times and the first, second and third numbers into the risk user identification model to identify whether the current user is a risk user.

In specific implementation, in order to further improve the accuracy of the model for identifying the risky users, the number of punished times of the current users in a second preset time, and the first number, the second number, and the third number may be input into the risky user identification model. Optionally, the punishment times include the number of times of banned words, the number of times of account numbers of blocked words, and other punishment times, which are not limited herein. The second preset time may be set as needed, and is not limited herein, for example, the second preset time may be set to one month.

In specific implementation, considering that only those users who have abnormal speech may be risk users, in order to further improve the efficiency of the risk user identification model, after determining the first number of abnormal speech of the current user in the first preset time, only when the first number of abnormal speech of the user is greater than or equal to 1, the user is identified through the risk user identification model.

In some exemplary embodiments, the method for identifying a risky user provided by the present application may be cyclically used in a system with a social function according to a certain time, so as to continuously identify a risky user among new users, and update an identification state of an old user.

According to the social risk assessment method and device, through big data processing and machine learning technologies, social risks are evaluated for the users actively, the users who can cause adverse effects on social environments are identified efficiently, and the problems of being insufficient in initiative and high in auditing labor cost are solved. If the user speech does not contain abnormal speech, the user speech is determined to be a normal user, model distinguishing is not needed, a large number of normal user users can be filtered by the operation, the number of users needing to be identified by the model is reduced, and the running speed is increased. Meanwhile, the social risks of the users are measured in a much higher frequency than that of manpower in an automatic task scheduling mode, the response speed of the users who deal with the bad behaviors is greatly improved, the user risk degrees measured in a machine learning mode can even identify the users who are about to do bad behaviors, and front-end striking is achieved.

Exemplary device

Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a device for recognizing the user speech.

Referring to fig. 4, the apparatus for recognizing a user utterance includes:

the vocabulary acquisition module 301 is used for acquiring vocabulary information in the user speech to be recognized;

a pinyin determining module 302, which determines pinyin information corresponding to the vocabulary information based on the vocabulary information;

the fusion module 303 is used for fusing the vocabulary information and the pinyin information to obtain comprehensive characteristic information;

and the speech recognition module 304 is used for inputting the comprehensive characteristic information into a trained user speech model to recognize the user speech.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations of the present disclosure.

The apparatus of the foregoing embodiment is used to implement the corresponding method for recognizing the user speech in any embodiment of the foregoing exemplary method portions, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a device for identifying the risk user.

Referring to fig. 5, the identification means of the risky user includes:

the determining module 401 is configured to identify the speech of the current user within the first preset time by using the user speech identification method, and determine, based on the identification result, a first number of abnormal speech of the current user within the first preset time;

the quantity obtaining module 402 is configured to obtain a second quantity of the private chat users of the current user within a first preset time, and a third quantity of the private chat risk users of the current user within the first preset time;

the user identification module 403 is configured to input the first quantity, the second quantity, and the third quantity into a risk user identification model obtained through training, so as to identify whether the current user is a risk user.

The apparatus of the foregoing embodiment is used to implement the method for identifying a risky user in any embodiment of the foregoing exemplary method portions, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to the method of any of the above embodiments, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method for recognizing the user utterance or the method for recognizing the risky user according to any of the above embodiments when executing the program.

Fig. 6 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

The bus 1050 includes multiple paths that carry information between various components of the device, such as the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used to implement the method for recognizing the speech of the user in any embodiment of the foregoing exemplary method or the method for recognizing the risky user in any embodiment of the foregoing exemplary method. And has the beneficial effects of the corresponding method embodiments, which are not described herein again.

Exemplary program product

Based on the same inventive concept, corresponding to any of the above-mentioned embodiment methods, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the method for identifying a user utterance or the method for identifying a risky user according to any of the above-mentioned embodiments.

The non-transitory computer readable storage medium may be any available medium or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

The storage medium of the above embodiment stores computer instructions for causing the computer to perform the method for identifying a user utterance or the method for identifying a risky user according to any of the above embodiments. And has the beneficial effects of the corresponding method embodiments, which are not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, method or computer program product. Thus, the present application may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.) or a combination of hardware and software, and is referred to herein generally as a "circuit," module "or" system. Furthermore, in some exemplary embodiments, the present invention may also be embodied as a computer program product in one or more computer-readable media having computer-readable program code embodied therein.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive example) of the computer readable storage medium may include, for example: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Use of the verbs "comprise", "comprise" and their conjugations in this application does not exclude the presence of elements or steps other than those stated in this application. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. A method for recognizing a user utterance, comprising:

acquiring vocabulary information in a user speech to be recognized;

2. The method of claim 1, wherein the process of training the user speech model comprises:

3. The method of claim 1, wherein fusing the vocabulary information and the pinyin information specifically comprises:

4. The method of claim 1, wherein prior to determining pinyin information corresponding to the vocabulary information based on the pinyin information, the method further comprises:

responding to the fact that the vocabulary information contains the detachable vocabulary, splitting the detachable vocabulary into a plurality of vocabularies, and re-acquiring the vocabulary information to be processed in the user speech based on the split vocabularies.

5. A method for identifying an at-risk user, comprising:

recognizing the speech of the current user in a first preset time by the method of any one of claims 1 to 4, and determining a first number of abnormal speech of the current user in the first preset time based on the recognition result;

6. The method of claim 5, wherein the inputting the first number, the second number, and the third number into a risk user identification model obtained by training to identify whether the current user is a risk user comprises:

obtaining the punished times of the current user within a second preset time;

inputting the penalized number and the first, second and third numbers into the risky user identification model to identify whether the current user is a risky user.

7. The method of claim 5, wherein the inputting the first number, the second number, and the third number into a risk user identification model obtained by training to identify whether the current user is a risk user comprises:

8. The method according to claim 5, wherein recognizing the speech of the current user within the first preset time by the method according to any one of claims 1 to 4, and determining the first number of abnormal speech of the current user within the first preset time based on the recognition result comprises:

9. An apparatus for recognizing a user utterance, comprising:

the fusion module is used for fusing the vocabulary information and the pinyin information to obtain comprehensive characteristic information;

and the speech recognition module is used for inputting the comprehensive characteristic information into a trained user speech model and recognizing the user speech.

10. An apparatus for identifying an at-risk user, comprising:

the determining module is used for recognizing the speech of the current user in a first preset time through the method of any one of claims 1 to 4 and determining a first number of abnormal speech of the current user in the first preset time based on the recognition result;

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 4 or claims 5 to 8 when executing the program.

12. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 4 or 5 to 8.