CN112087726B

CN112087726B - Method and system for identifying polyphonic ringtone, electronic equipment and storage medium

Info

Publication number: CN112087726B
Application number: CN202010953701.2A
Authority: CN
Inventors: 邓艳江; 罗超; 胡泓; 李巍
Original assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2022-08-23
Anticipated expiration: 2040-09-11
Also published as: CN112087726A

Abstract

The invention discloses a method and a system for identifying polyphonic ringtone, electronic equipment and a storage medium. The method for identifying the color ring comprises the following steps: converting an input audio signal into text; judging whether keywords matched with the text exist or not; if yes, identifying the audio signal as a color ring; if not, inputting the audio signal into a color ring classification model, and determining whether the audio signal is a color ring according to a prediction result of the color ring classification model; the color ring back tone classification model is obtained based on training of training samples, and the training samples comprise color ring back tone samples and non-color ring back tone samples comprising human voice. The invention identifies the color ring by matching the keywords of the text obtained by converting the audio signal, if the matching is not successful, the audio signal is input into the color ring classification model for secondary identification, namely, the color ring is identified by using the text and the audio at the same time, thereby improving the accuracy of color ring identification.

Description

Method and system for identifying polyphonic ringtone, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and a system for identifying a color ring back tone, an electronic device, and a storage medium.

Background

With the development of artificial intelligence technology, many repetitive tasks are completed by machines, and a customer service robot is an example. The color ring back tone is a short for individualized colorful ring back tone service, and is a service of ring back tone which is set by called client for calling other calling client of own mobile phone with special sound effect (music, song, story plot and character conversation).

In the using process of the customer service robot, the text content contained in the polyphonic ringtone can be identified by mistake through the voice identification function, so that downstream intention identification and conversation management effects are caused, and further the whole conversation process is wrong. Therefore, it becomes necessary to recognize the bell and not let it enter the downstream intention recognition and session management.

The current color ring back tones can be roughly classified into three categories: one is pure background music; second, pure people report, for example: welcome you to a hotel, etc.; and thirdly, the voice broadcast accompanied by background music. For the first two types of color ring back tones, the traditional text matching method can be used for identification. For the third type of polyphonic ringtone, due to the existence of background music, the customer service robot cannot recognize a complete sentence, so the third type of polyphonic ringtone cannot be accurately recognized by using a text matching method, and the accuracy of polyphonic ringtone recognition is reduced.

Disclosure of Invention

The invention provides a color ring identification method and system, electronic equipment and a storage medium, aiming at overcoming the defect that the color ring accompanied with the voice broadcast of background music cannot be accurately identified in the prior art.

The invention solves the technical problems through the following technical scheme:

the first aspect of the present invention provides a method for identifying a color ring, comprising the following steps:

converting an input audio signal into text;

judging whether keywords matched with the text exist or not;

if yes, identifying the audio signal as a color ring;

if not, inputting the audio signal into a color ring back tone classification model, and determining whether the audio signal is a color ring back tone according to a prediction result of the color ring back tone classification model; the color ring back tone classification model is obtained based on training of training samples, and the training samples comprise color ring back tone samples and non-color ring back tone samples comprising human voice.

Preferably, the inputting the audio signal into a color ring classification model and determining whether the audio signal is a color ring according to a prediction result of the color ring classification model specifically include:

performing framing processing on the audio signal to obtain a plurality of frames of sub-audio signals;

respectively detecting sub audio signals of each frame to obtain an effective frame, wherein the effective frame is a sub audio signal comprising a speech area;

inputting the effective frame into a color ring classification model to obtain a prediction result of the effective frame;

and determining whether the audio signal is a color ring according to the prediction results of all the effective frames.

Preferably, the determining whether the audio signal is a color ring according to the prediction results of all valid frames specifically includes:

if the ratio of the number of the effective frames, namely the polyphonic ringtone, to the number of the effective frames is greater than a preset value, determining that the audio signal is the polyphonic ringtone, otherwise, determining that the audio signal is not the polyphonic ringtone.

Preferably, the inputting the valid frame to the color ring classification model specifically includes:

carrying out windowing and Fourier transform preprocessing on the effective frame to obtain the frequency spectrum characteristics of the effective frame;

and inputting the frequency spectrum characteristics of the effective frame into a color ring classification model.

Preferably, the color ring classification model is a neural network model.

The second aspect of the invention provides a system for identifying polyphonic ringtone, which comprises a conversion module, a judgment module and a classification module;

the conversion module is used for converting the input audio signal into text;

the judging module is used for judging whether keywords matched with the text exist or not, identifying the audio signal as a color ring under the condition of yes, and calling the classifying module under the condition of no;

the classification module is used for inputting the audio signal into a polyphonic ringtone classification model and determining whether the audio signal is a polyphonic ringtone according to the prediction result of the polyphonic ringtone classification model; the color ring back tone classification model is obtained based on training of training samples, and the training samples comprise color ring back tone samples and non-color ring back tone samples comprising human voice.

Preferably, the classification module specifically includes:

the framing unit is used for framing the audio signal to obtain a plurality of frame sub-audio signals;

the detection unit is used for respectively detecting the sub-audio signals of each frame to obtain an effective frame, wherein the effective frame is the sub-audio signal comprising a voice area;

the input unit is used for inputting the effective frame to a polyphonic ringtone classification model to obtain a prediction result of the effective frame;

and the determining unit is used for determining whether the audio signal is a color ring according to the prediction results of all the effective frames.

Preferably, the determining unit is specifically configured to determine that the audio signal is a color ring when a ratio of the number of frames in which the valid frame is the color ring to the number of frames in all valid frames is greater than a preset value, and otherwise, determine that the audio signal is not the color ring.

A third aspect of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method for identifying a color ring according to the first aspect when executing the computer program.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the method for color ring back tone identification of the first aspect.

The positive progress effects of the invention are as follows: the polyphonic ringtone is identified by matching keywords with the text obtained by converting the audio signal, if the polyphonic ringtone is not successfully matched with the text, the audio signal is input into the polyphonic ringtone classification model for secondary identification, namely, the polyphonic ringtone is identified by using the text and the audio at the same time, so that the accuracy of polyphonic ringtone identification is improved.

Drawings

Fig. 1 is a flowchart of a method for identifying a color ring according to embodiment 1 of the present invention.

Fig. 2 is a flowchart of step S104 according to embodiment 1 of the present invention.

Fig. 3 is a block diagram of a system for identifying a color ring back tone according to embodiment 2 of the present invention.

Fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.

Example 1

The present embodiment provides a method for identifying a color ring, as shown in fig. 1, including the following steps:

step S101, converting the input audio signal into text.

In an example of applying the color ring back tone identification method provided in this embodiment to a customer service robot, the customer service robot has a voice recognition function and is configured to convert an input audio signal into a text.

And step S102, judging whether keywords matched with the text exist or not, if so, executing step S103, otherwise, executing step S104.

The keywords in step S102 are a set, and are collected based on the texts in the historical color ring, and may include chinese, english, phone numbers, and the like. In a specific example, the keywords include "about your call", "for your service", "Thanks for calling", "400-" and so on.

In an example of the specific implementation, the keywords are stored locally in advance, and the keywords are matched with the text obtained in step S101 in the actual use process. In another example of the specific implementation, the keywords are pre-stored in the server, the text obtained in step S101 is transmitted to the server in the actual use process, keyword matching is performed at the server, and the matching result is returned to the local.

In an optional implementation manner of step S102, the text obtained in step S101 is subjected to keyword refinement to obtain target keywords, the target keywords are used to be matched with the keywords stored in advance to obtain the keywords with the highest matching degree with the target keywords, if the matching degree is greater than a preset value, for example, 90%, it is determined that the keywords matching the text obtained in step S101 exist, and otherwise, it is determined that the keywords matching the text obtained in step S101 do not exist.

In another embodiment that may be implemented in step S102, whether the text obtained in step S101 includes pre-stored keywords is sequentially determined, if yes, it is determined that there are keywords that match the text, otherwise, it is determined that there are no keywords that match the text. In one example, the pre-stored keywords include "call you", "serve you at home", and "Thanks for calling", and it is determined whether the above-mentioned keyword in the text "good you", "hotel serves you at home" obtained in step S101 is the above-mentioned keyword, and it is determined that the keyword includes "serve you at home", that is, there is a keyword matching with the above-mentioned text.

And step S103, identifying the audio signal as a color ring.

Step S104, inputting the audio signal into a color ring back tone classification model, and determining whether the audio signal is a color ring back tone according to the prediction result of the color ring back tone classification model. The color ring back tone classification model is obtained based on training of training samples, and the training samples comprise color ring back tone samples and non-color ring back tone samples comprising human voice.

It should be noted that the non-polyphonic ringtone samples including the human voice are audio samples of the normal speech of the user. In an example implementation, the audio that the user normally speaks includes "please help me to order a room in hotel at 10 months 1 x", "where my order is seen", "feed, hello", etc.

In the specific implementation of step S104, the prediction result of the color ring classification model is a value between 0 and 1. In one example, if the prediction result is less than 0.5, it is determined that the audio signal is not a color ring, and if the prediction result is greater than 0.5, it is determined that the audio signal is a color ring. In another example, if the prediction result is less than 0.8, it is determined that the audio signal is not a color ring, and if the prediction result is greater than 0.8, it is determined that the audio signal is a color ring.

In an example of the specific implementation, the polyphonic ringtone classification model is trained by using the following steps:

acquiring a training sample comprising a polyphonic ringtone sample and a non-polyphonic ringtone sample comprising a human voice; the CRBT sample comprises an audio signal of a CRBT and a corresponding classification label, and the non-CRBT sample comprises an audio signal of a user speaking normally and a corresponding classification label;

and inputting the training samples into the constructed polyphonic ringtone classification model, and adjusting the parameters of the polyphonic ringtone classification model according to the output prediction result and the classification labels of the training samples until the polyphonic ringtone classification model is converged.

In an optional implementation manner, the color ring back tone classification model is a neural network model. In a specific example, the color ring classification model includes a three-layer neural network, and a Softmax function (normalized exponential function) is used as the activation function.

In an alternative embodiment, step S104 includes:

step S104a, performing framing processing on the audio signal to obtain a plurality of frame sub-audio signals. In a specific example, an input audio signal with a length of 1s is divided into sub audio signals with a fixed length of 10ms, i.e., each frame of the sub audio signals has a length of 10 ms.

Step S104b, detecting the sub audio signals of each frame respectively to obtain an effective frame, where the effective frame is a sub audio signal including a speech region.

In step S104b, a speech area or a mute area is detected for each frame of sub-audio signals, where the sub-audio signals including the speech area are valid frames, and the sub-audio signals not including the speech area, i.e., both being mute areas, are invalid frames.

In an alternative embodiment of step S104b, VAD (Voice Activity Detection) is used to detect each frame of sub-audio signals.

Step S104c, inputting the effective frame to a color ring classification model to obtain the prediction result of the effective frame.

In an optional implementation manner of step S104c, the effective frame obtained in step S104b is subjected to windowing and fourier transform preprocessing to obtain a spectrum feature of the effective frame, and the spectrum feature of the effective frame is input to a color ring classification model.

Step S104d, determining whether the audio signal is a color ring according to the prediction results of all valid frames.

The prediction result of each valid frame may be a color ring or a human voice, so in step S104d, it is necessary to determine whether the input audio signal is a color ring according to the prediction results of all valid frames.

In an optional implementation manner of step S104d, if the ratio of the number of valid frames, which is the color ring tone, to the number of all valid frames is greater than a preset value, it is determined that the audio signal is the color ring tone, otherwise, it is determined that the audio signal is not the color ring tone. The preset value may be set according to an actual situation, for example, the preset value is set to 80%, that is, when a ratio of a number of valid frames that are color ring tones to a number of all valid frames is greater than 80%, it is determined that the input audio signal is a color ring tone, not a human voice.

Fig. 2 is a flowchart for illustrating a step S104. As shown in fig. 2, the input audio signal is subjected to framing processing to obtain a plurality of frame sub-audio signals, and each frame of sub-audio signal is traversed to perform the following processing: and performing VAD detection on the sub-audio signals, judging whether the sub-audio signals are valid frames according to the detection result, inputting the sub-audio signals into a color ring classification model if the sub-audio signals are valid frames, judging whether the valid frames are color rings according to the model prediction result, and recording the number of the valid frames as the number of the color rings if the sub-audio signals are valid frames. After traversing all the sub audio signals, identifying whether the input audio signals are the color ring according to the frame number of the effective frames which are the color ring and the frame number of all the effective frames, and if the ratio of the frame number of the effective frames which are the color ring to the frame number of all the effective frames is larger than a preset value, identifying the input audio signals as the color ring.

In the embodiment, the polyphonic ringtone is identified by matching the keywords of the text obtained by converting the audio signal, and if the polyphonic ringtone is not successfully matched with the text, the audio signal is input into the polyphonic ringtone classification model to be identified for the second time, namely, the polyphonic ringtone is identified by using the text and the audio, so that the accuracy of polyphonic ringtone identification is improved.

The present embodiment will be described in detail below by taking an example of "welcome you to call a hotel, and we will provide the best service" in which an input audio signal broadcasts a color ring for a person accompanied by background music. Firstly, an input audio signal is converted into a text, and due to the existence of background music, the converted text is ' welcome ' hotel, and we provide premium services ', keyword matching is performed by using the text, and no matched keyword is found. Then, the input audio signal is input to the trained color ring classification model, and the audio is determined to be the color ring according to the prediction result of the color ring classification model. That is, the color ring back tone identification method provided by the embodiment can successfully identify the voice broadcast color ring back tone accompanied by the background music, and compared with the prior art, the accuracy of color ring back tone identification is improved.

In an example of applying the color ring identification method provided in this embodiment to a customer service robot, if the customer service robot identifies an input audio signal as a color ring, the audio signal is shielded, that is, the audio signal does not enter downstream intention identification and session management, so that the color ring is prevented from interfering with a normal session.

Example 2

The embodiment provides a system 20 for identifying a color ring back tone, as shown in fig. 3, including a conversion module 21, a determination module 22, and a classification module 23.

The conversion module 21 is used to convert the input audio signal into text.

The judging module 22 is configured to judge whether a keyword matched with the text exists, identify the audio signal as a color ring if the keyword exists, and call the classifying module if the keyword does not exist;

the classification module 23 is configured to input the audio signal to a color ring classification model, and determine whether the audio signal is a color ring according to a prediction result of the color ring classification model; the color ring back tone classification model is obtained based on training of training samples, and the training samples comprise color ring back tone samples and non-color ring back tone samples comprising human voice.

In an optional implementation manner, the color ring classification model is a neural network model.

In an alternative embodiment, as shown in fig. 3, the classification module 23 specifically includes a framing unit, a detection unit, an input unit, and a determination unit.

The framing unit is used for framing the audio signal to obtain a plurality of frames of sub-audio signals.

The detection unit is used for respectively detecting the sub-audio signals of each frame to obtain an effective frame, wherein the effective frame is the sub-audio signal comprising a speech area.

The input unit is used for inputting the effective frame to a color ring classification model to obtain a prediction result that the effective frame is a color ring or a human voice.

In an optional implementation manner, the input unit is specifically configured to perform windowing and preprocessing of fourier transform on the effective frame to obtain a spectral feature of the effective frame; and inputting the frequency spectrum characteristics of the effective frame into a color ring classification model.

The determining unit is used for determining whether the audio signal is a color ring according to the prediction results of all the effective frames.

In an optional implementation manner, the determining unit is specifically configured to determine that the audio signal is a color ring tone when a ratio of a number of valid frames that are color ring tones to a number of all valid frames is greater than a preset value, and otherwise determine that the audio signal is not a color ring tone.

Example 3

Fig. 4 is a schematic structural diagram of an electronic device provided in this embodiment. The electronic device comprises a memory, a processor, a computer program stored on the memory and capable of running on the processor, and a plurality of subsystems for realizing different functions, wherein the processor realizes the color ring identification method of embodiment 1 when executing the program. The electronic device 3 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

The components of the electronic device 3 may include, but are not limited to: the at least one processor 4, the at least one memory 5, and a bus 6 connecting the various system components (including the memory 5 and the processor 4).

The bus 6 includes a data bus, an address bus, and a control bus.

The memory 5 may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).

The memory 5 may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor 4 executes various functional applications and data processing, such as the method for identifying a color ring back tone according to embodiment 1 of the present invention, by running the computer program stored in the memory 5.

The electronic device 3 may also communicate with one or more external devices 7, such as a keyboard, pointing device, etc. Such communication may be via an input/output (I/O) interface 8. Also, the electronic device 3 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 9. As shown in fig. 4, the network adapter 9 communicates with other modules of the electronic device 3 via the bus 6. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with the electronic device 3, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 4

The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the method of color ring back tone identification of embodiment 1.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation manner, the present invention can also be implemented in the form of a program product, which includes a program code, and when the program product runs on a terminal device, the program code is configured to enable the terminal device to execute the method for implementing the color ring identification in embodiment 1.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. A method for identifying a color ring back tone is characterized by comprising the following steps:

converting an input audio signal into text;

judging whether keywords matched with the text exist or not;

if yes, identifying the audio signal as a color ring;

if not, inputting the audio signal into a color ring classification model, and determining whether the audio signal is a color ring according to a prediction result of the color ring classification model; the polyphonic ringtone classification model is obtained based on training of training samples, wherein the training samples comprise polyphonic ringtone samples and non-polyphonic ringtone samples comprising human voice; determining the audio signal which is the color ring as the color ring is a human broadcast color ring accompanied by background music through the color ring classification model;

the inputting the audio signal into a color ring classification model, and determining whether the audio signal is a color ring according to a prediction result of the color ring classification model specifically include:

respectively detecting sub audio signals of each frame to obtain an effective frame, wherein the effective frame is a sub audio signal comprising a voice area;

2. The method of claim 1, wherein the determining whether the audio signal is a color ring according to the prediction results of all valid frames specifically comprises:

if the ratio of the number of the effective frames, namely the polyphonic ringtone, to the number of all the effective frames is greater than a preset value, determining that the audio signal is the polyphonic ringtone, and otherwise, determining that the audio signal is not the polyphonic ringtone.

3. The method of claim 1, wherein the inputting the valid frame to a color ring classification model specifically comprises:

4. The method of any of claims 1-3, wherein the color ring back tone classification model is a neural network model.

5. A system for identifying color ring back tone is characterized by comprising a conversion module, a judgment module and a classification module;

the conversion module is used for converting the input audio signal into text;

the judging module is used for judging whether keywords matched with the text exist or not, identifying the audio signal as a color ring if the keywords exist, and calling the classifying module if the keywords do not exist;

the classification module is used for inputting the audio signal into a color ring classification model and determining whether the audio signal is a color ring according to the prediction result of the color ring classification model; the color ring back tone classification model is obtained based on training of a training sample, wherein the training sample comprises a color ring back tone sample and a non-color ring back tone sample comprising a human voice; determining the audio signal which is the color ring as the color ring broadcast color ring by the color ring classification model;

the classification module specifically comprises:

the framing unit is used for framing the audio signals to obtain a plurality of frames of sub-audio signals;

the input unit is used for inputting the effective frame into a color ring classification model to obtain a prediction result of the effective frame;

and the determining unit is used for determining whether the audio signal is the color ring according to the prediction results of all the effective frames.

6. The system of claim 5, wherein the determining unit is specifically configured to determine that the audio signal is a color ring tone when a ratio of a number of valid frames that are color ring tones to a number of all valid frames is greater than a preset value, and otherwise determine that the audio signal is not a color ring tone.

7. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of color ring back tone identification of any one of claims 1-4 when executing the computer program.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of ring back tone identification according to any one of claims 1-4.