CN112114886B

CN112114886B - Acquisition method and device for false wake-up audio

Info

Publication number: CN112114886B
Application number: CN202010981082.8A
Authority: CN
Inventors: 李旭; 杜霜霜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2024-03-29
Anticipated expiration: 2040-09-17
Also published as: CN112114886A

Abstract

The application discloses a method and a device for acquiring false wake-up audio, and relates to the technical field of voice recognition. The specific embodiment comprises the following steps: collecting played audio as audio to be processed; inputting the audio to be processed into a preset depth neural network model to obtain the position of an approximate wake-up audio fragment of the audio to be processed in the audio to be processed, wherein the depth neural network model is used for predicting the position of the approximate audio fragment of the wake-up word audio in the input audio; based on the approximate locations of the wake-up audio segments, a false wake-up segment is determined in the audio to be processed and a set of false wake-up segments including the false wake-up segment is generated. According to the method and the device, the position of the mistaken wake-up fragment in the audio can be accurately found through the deep neural network model, and then an accurate mistaken wake-up fragment set is obtained.

Description

Acquisition method and device for false wake-up audio

Technical Field

The application relates to the technical field of computers, in particular to the technical field of voice recognition, and especially relates to a method and a device for acquiring false wake-up audio.

Background

False wake-up in intelligent voice interactive products has been a phenomenon that the probability of triggering is low, but once triggered, the phenomenon causes strong objection to the user.

The false wake-up is easy to generate under the conditions of television playing, music playing and the like, the sources of the corpora are various, the corpora are difficult to capture in a test environment, and even if the corpora are captured, the current scene is difficult to completely reproduce, so that the intelligent product generates the false wake-up. Especially for some new devices, it is more difficult to collect the false wake-up audio.

Disclosure of Invention

Provided are a method, a device, electronic equipment and a storage medium for acquiring false wake-up audio.

According to a first aspect, there is provided a method for acquiring false wake-up audio, including: collecting played audio as audio to be processed; inputting the audio to be processed into a preset depth neural network model to obtain the position of an approximate wake-up audio fragment of the audio to be processed in the audio to be processed, wherein the depth neural network model is used for predicting the position of the approximate audio fragment of wake-up word audio in the input audio; and determining a false wake-up fragment in the audio to be processed based on the position of the approximate wake-up audio fragment, and generating a false wake-up fragment set comprising the false wake-up fragment.

According to a second aspect, there is provided an acquisition apparatus of false wake-up audio, including: the acquisition unit is configured to acquire played audio as audio to be processed; the prediction unit is configured to input the audio to be processed into a preset depth neural network model to obtain the position of an approximate wake-up audio fragment of the audio to be processed in the audio to be processed, wherein the depth neural network model is used for predicting the position of the approximate audio fragment of wake-up word audio in the input audio; a generating unit configured to determine a false wake-up section in the audio to be processed based on the position of the approximate wake-up audio section, and generate a false wake-up section set including the false wake-up section.

According to a third aspect, there is provided an electronic device comprising: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method such as any of the embodiments of the wake-up by mistake audio acquisition method.

According to a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements a method as any of the embodiments of the method of acquiring false wake-up audio.

According to the scheme, the position of the false wake-up fragment in the audio can be accurately found through the deep neural network model, and then an accurate false wake-up fragment set is obtained.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which some embodiments of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method of acquisition of false wake-up audio according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a method of acquisition of false wake-up audio according to the present application;

FIG. 4a is a flow chart of yet another embodiment of a method of acquisition of false wake-up audio according to the present application;

FIG. 4b is a schematic diagram of yet another application scenario of a method of acquisition of false wake-up audio according to the present application;

FIG. 5 is a schematic diagram of one embodiment of a false wake-up audio acquisition device according to the present application;

fig. 6 is a block diagram of an electronic device for implementing a method for acquiring false wake-up audio according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of a method of or apparatus for acquisition of false wake-up audio of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as smart interactive applications, video applications, live applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, electronic book readers, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the terminal devices 101, 102, 103. The background server can analyze and process the received data such as audio to be processed, and feed back the processing result (such as a false wake-up fragment set) to the terminal device.

It should be noted that, the method for acquiring the wake-up error audio provided in the embodiment of the present application may be executed by the server 105 or the terminal devices 101, 102, 103, and accordingly, the device for acquiring the wake-up error audio may be set in the server 105 or the terminal devices 101, 102, 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a method of acquisition of false wake-up audio according to the present application is shown. The method for acquiring the false wake-up audio comprises the following steps:

step 201, collecting the played audio as the audio to be processed.

In this embodiment, an execution body (for example, a server or a terminal device shown in fig. 1) on which the method for acquiring a wake-up error audio runs may collect a played audio and use the audio as an audio to be processed.

In practice, the executing body may record the played audio, so as to collect the audio to be processed. Specifically, the executing body or other electronic device may record the audio in wav format at a sampling rate of 16K. The executing body can then perform lossless preservation on the audio to be processed. In addition, the collection may also be referred to as collection.

Alternatively, the audio to be processed may be audio for playing in a home scene and/or a conference scene. Such as television shows, audio books, music, etc.

Step 202, inputting the audio to be processed into a preset deep neural network model to obtain the position of the approximate wake-up audio fragment of the audio to be processed in the audio to be processed, wherein the deep neural network model is used for predicting the position of the approximate audio fragment of the wake-up word audio in the input audio.

In this embodiment, the execution body may input the audio to be processed into a preset deep neural network model, so as to obtain a position of the approximate wake-up audio segment in the audio to be processed. The deep neural network model may be a variety of deep neural networks, such as convolutional neural networks, residual neural networks, and the like.

Wake-up word audio refers to audio corresponding to the text of the wake-up word. The approximate audio here includes the same audio or similar audio, that is, the same or similar audio of the pronunciation.

Step 203, determining a false wake-up segment in the audio to be processed based on the approximate wake-up audio segment position, and generating a false wake-up segment set including the false wake-up segment.

In this embodiment, the execution body may determine the false wake-up section based on the approximate location of the wake-up audio section. In this way, the execution body may generate a false wake segment set including false wake segments, e.g., directly form the false wake segment set from the respective false wake segments.

In practice, the executing entity may determine the false wake-up section based on the approximate location of the wake-up audio section in various ways. For example, the executing body may directly intercept the similar wake-up audio segment as the false wake-up segment in the audio to be processed.

The method provided by the embodiment of the application can accurately find the position of the false wake-up fragment in the audio through the deep neural network model, and further obtain an accurate false wake-up fragment set.

In some alternative implementations of the present embodiment, the location includes a start point and an end point; determining a false wake-up segment in the audio to be processed based on the approximate wake-up audio segment location in step 203, comprising: determining the starting point of the audio fragment with the first preset duration in the previous direction in the audio to be processed as a target starting point at the starting point of the approximate wake-up audio fragment; at the approximate end point of the awakening audio fragment, determining the end point of the audio fragment with the second preset duration in the backward direction in the audio to be processed as a target end point; and extracting an audio fragment corresponding to the target starting point to the target ending point, and determining the extracted audio fragment as a false wake-up fragment.

In these alternative implementations, the preceding refers to the start of the playback time preceding the approximate wake-up audio clip, and the following refers to the end of the playback time following the approximate wake-up audio clip, e.g., a video with a duration of 60 minutes, the playback time at 20 minutes, and the playback time at 25 minutes. The execution body may advance the start point of the near-wakeup audio piece and advance the end point of the near-wakeup audio piece to extend the length of the near-wakeup audio piece. Alternatively, the first preset time period and the second preset time period may be equal in length. The start point and the end point may be the playing time of the start point and the end point of the audio fragment in the whole audio, or may be the ordering of the start point audio frame and the end point audio frame in the whole audio.

The execution body may extract the extended near-wake-up audio segment, such as by cutting the audio. The extracted result is the false wake-up segment.

These implementations may utilize appropriately extending the audio clip to facilitate more accurate reproduction of the actual wake scene when playing the false wake clip.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for acquiring false wake-up audio according to the present embodiment. In the application scenario of fig. 3, the execution subject 301 collects the played audio as the audio to be processed 302. The execution body 301 obtains a position 304 of an approximate wake-up audio segment of the audio to be processed in the audio to be processed from the audio to be processed 302 to a preset deep neural network model 303, where the deep neural network model is used for predicting the position of the approximate audio segment of the wake-up word audio in the input audio. The execution body 301 determines a false wake up segment in the audio to be processed based on the approximate wake up audio segment location 304 and generates a set of false wake up segments 305 comprising the false wake up segment.

With further reference to fig. 4a, a flow 400 of yet another embodiment of a method of acquisition of false wake-up audio is shown. The process 400 includes the steps of:

step 401, collecting the played audio as the audio to be processed.

Step 402, inputting the audio to be processed into a preset deep neural network, and obtaining the wake-up confidence coefficient of the approximate wake-up audio fragment and the position in the audio to be processed, which are output from the deep neural network model, wherein the wake-up confidence coefficient of the approximate wake-up audio fragment for the wake-up word audio is higher than the confidence coefficient threshold of the deep neural network, and the deep neural network model is used for predicting the position of the approximate audio fragment of the wake-up word audio in the input audio.

In this embodiment, the deep neural network may output not only the position of the approximate wake-up audio segment in the audio to be processed, but also the wake-up confidence of the approximate wake-up audio segment. The wake confidence of the near wake audio segment refers to the probability that the electronic device considers the near wake audio segment to be wake word audio and thus to be wake. The deep neural network may output the approximate wake-up audio segment as an approximate audio segment if it is determined that the wake-up confidence of the approximate wake-up audio segment is above a confidence threshold. That is, the wake confidence of the approximate audio piece output by the deep neural network is above the confidence threshold.

Step 403, determining a false wake-up segment in the audio to be processed based on the approximate wake-up audio segment position, and generating a false wake-up segment set including the false wake-up segment.

The embodiment can accurately indicate the possibility that the intelligent device is awakened by the approximate awakening audio piece by utilizing the awakening confidence coefficient, and control the recall rate of the deep neural network through the confidence coefficient threshold value.

In some optional implementations of the present embodiment, generating the set of false wake segments including the false wake segment in step 403 may include: splicing each false wake-up fragment to generate a false wake-up fragment set comprising a splicing result; and the method may further comprise: responsive to the number of locations of the output approximate wake-up audio segments being less than a target number threshold, the confidence threshold is turned down, wherein the target number threshold is associated with a time length of audio to be processed; and re-inputting the audio to be processed into a depth neural network model with the confidence coefficient threshold value lowered to obtain the position and the wake-up confidence coefficient of the approximate wake-up audio fragment in the audio to be processed, determining the false wake-up fragment based on the position of the approximate wake-up audio fragment, splicing the false wake-up fragments, and updating a false wake-up fragment set by using the splicing result, wherein the duration of the splicing result of each false wake-up fragment in the false wake-up fragment set is longer than the target duration.

In these alternative implementations, the execution entity may reenter the audio to be processed into the deep neural network model with the confidence threshold lowered, thereby obtaining a new location of the approximate wake-up audio segment output by the deep neural network model, and determine the false wake-up segment based on the location output herein. The execution main body can splice each false wake-up fragment, and update the false wake-up fragment set by using a new splicing result obtained by splicing.

In practice, the execution body may update the wake-up error fragment set in various manners, for example, the execution body may directly remove the content in the original wake-up error fragment set, and form the wake-up error fragment set from the new splicing result. In addition, the execution main body can add the new splicing result into the original false wake-up fragment set, so that the added result is used as the updated false wake-up fragment set.

The execution body may determine the target number threshold in various manners. For example, the executing body may query a corresponding relation table of a preset number threshold and a duration of the audio to be processed, and take a number threshold corresponding to the duration of the audio to be processed input into the deep neural network model as the target number threshold. Alternatively, the execution body may input the duration of the audio to be processed into a preset number threshold determination model, and take the output of the model as the target number threshold.

The implementation modes can lower the confidence threshold under the condition that the recall rate of the deep neural network model is low, so that the recall rate of the deep neural network model to the audio fragment is improved.

In some optional implementations of this embodiment, the method may further include: obtaining wake-up results of a sound box play false wake-up fragment set placed at a plurality of angles on tested equipment; if the number of the false wake-up fragments of the device to be tested reaches a specified number threshold in the wake-up result, determining that the false wake-up fragment set is an effective set, wherein the specified number threshold is related to the number of the false wake-up fragments included in the false wake-up fragment set; if the number of the false wake-up fragments of the device to be tested in the wake-up result does not reach the specified number threshold, determining that the false wake-up fragment set is an invalid set.

In these alternative implementations, the executing body may play the set of wake-up segments by using a plurality of sound boxes placed at a plurality of angles with respect to the device under test, so as to test a wake-up effect on the device under test, and obtain a wake-up result. If the wake-up result indicates that more false wake-up fragments in the false wake-up fragment set can all call up the tested device, the false wake-up fragment set can be determined to be an effective set. If the wake-up result indicates that fewer false wake-up fragments in the false wake-up fragment set can arouse the tested device, the false wake-up fragment set can be determined to be an invalid set.

In practice, the execution body may determine the specified number of thresholds in various ways. For example, the execution body may find, in a corresponding relationship between a preset number threshold and the number of false wake-up fragments, a number threshold corresponding to the number of false wake-up fragments included in the false wake-up fragment set, as the specified number threshold. In addition, the execution body may input the number of false wake-up fragments included in the false wake-up fragment set into a preset specified number threshold determining model, and take an output of the model as the specified number threshold.

As shown in fig. 4b, a schematic diagram of a tested device receiving a set of false wake-up fragments played by sound boxes (i.e., sound sources) at various angles is shown, wherein a plurality of sound boxes are within 180 degrees of the tested device.

These implementations can accurately verify whether a set of false wake segments is valid by playing the set of false wake segments from multiple angles to perform a wake test.

Optionally, the method may further include: in the event that the false wake-up segment set is determined to be an invalid set, the confidence threshold is raised.

Specifically, the execution body may raise the confidence threshold of the deep neural network if it is determined that the wake-up error fragment set is an invalid set. In this way, the execution body can regenerate a more accurate false wake-up fragment set by using the deep neural network after the confidence threshold is increased.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a device for obtaining a wake-up-by-mistake audio, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the embodiment of the device may further include the same or corresponding features or effects as the embodiment of the method shown in fig. 2, except for the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for acquiring false wake-up audio of the present embodiment includes: an acquisition unit 501, a prediction unit 502, and a generation unit 503. Wherein, the obtaining unit 501 is configured to collect the played audio as the audio to be processed; the prediction unit 502 is configured to input the audio to be processed into a preset deep neural network model to obtain the position of the approximate wake-up audio fragment of the audio to be processed in the audio to be processed, wherein the deep neural network model is used for predicting the position of the approximate audio fragment of the wake-up word audio in the input audio; a generating unit 503 is configured to determine a false wake-up section in the audio to be processed based on the approximate position of the wake-up audio section, and generate a false wake-up section set including the false wake-up section.

In this embodiment, the specific processing and the technical effects of the acquisition unit 501, the prediction unit 502 and the generation unit 503 of the error wake-up audio acquisition device 500 may refer to the related descriptions of the step 201, the step 202 and the step 203 in the corresponding embodiment of fig. 2, and are not repeated here.

In some alternative implementations of the present embodiment, the location includes a start point and an end point; the generating unit is further configured to determine a false wake-up segment in the audio to be processed based on the position of the approximate wake-up audio segment in the following manner: determining the starting point of the audio fragment with the first preset duration in the previous direction in the audio to be processed as a target starting point at the starting point of the approximate wake-up audio fragment; at the approximate end point of the awakening audio fragment, determining the end point of the audio fragment with the second preset duration in the backward direction in the audio to be processed as a target end point; and extracting an audio fragment corresponding to the target starting point to the target ending point, and determining the extracted audio fragment as a false wake-up fragment.

In some optional implementations of this embodiment, the prediction unit is further configured to perform deriving the approximate location of the wake-up audio piece of the audio to be processed in the audio to be processed as follows: and obtaining the awakening confidence of the approximate awakening audio fragment and the position in the audio to be processed, which are output from the deep neural network model, wherein the awakening confidence of the approximate awakening audio fragment for the awakening word audio is higher than the confidence threshold of the deep neural network model.

In some optional implementations of the present embodiment, the generating unit is further configured to perform generating the set of false wake-up fragments comprising false wake-up fragments as follows: splicing each false wake-up fragment to generate a false wake-up fragment set comprising a splicing result; the apparatus further comprises: a turn-down unit configured to turn down the confidence threshold in response to a number of positions of the output approximate wake-up audio piece being less than a target number threshold, wherein the target number threshold is associated with a time length of audio to be processed; the splicing unit is configured to re-input the audio to be processed into the depth neural network model with the confidence coefficient threshold value lowered, obtain the position of the approximate wake-up audio fragment in the audio to be processed and the wake-up confidence coefficient, determine the false wake-up fragment based on the position of the approximate wake-up audio fragment, splice the false wake-up fragment, and update the false wake-up fragment set by using the splicing result, wherein the duration of the splicing result of each false wake-up fragment in the false wake-up fragment set is longer than the target duration.

In some optional implementations of this embodiment, the apparatus further includes: the result acquisition unit is configured to acquire wake-up results of the sound box play false wake-up fragment set placed at a plurality of angles on the tested equipment; the first determining unit is configured to determine that the false wake-up fragment set is an effective set if the number of false wake-up fragments of the device to be tested reaches a specified number threshold in the wake-up result, wherein the specified number threshold is related to the number of false wake-up fragments included in the false wake-up fragment set; and the second determining unit is configured to determine that the false wake-up fragment set is an invalid set if the number of false wake-up fragments of the device to be tested in the wake-up result does not reach the specified number threshold.

In some optional implementations of this embodiment, the apparatus further includes: and a heightening unit configured to heighten the confidence threshold value in the case that the false wake-up fragment set is determined to be an invalid set.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, a block diagram of an electronic device according to an embodiment of the present application is a method for acquiring false wake-up audio. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 6, the electronic device includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 6.

Memory 602 is a non-transitory computer-readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for acquiring false wake-up audio provided by the application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the method for acquiring false wake-up audio provided by the present application.

The memory 602 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules (e.g., the acquisition unit 501, the prediction unit 502, and the generation unit 503 shown in fig. 5) corresponding to the method for acquiring false wake-up audio in the embodiments of the present application. The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 602, that is, implements the method for acquiring false wake-up audio in the above-described method embodiment.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created from the use of the acquiring electronic device of the false wake-up audio, and the like. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 may optionally include memory remotely located with respect to processor 601, which may be connected to the wake-by-mistake audio acquisition electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for acquiring the wake-up error audio may further include: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the acquiring electronic device for false wake-up audio, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a prediction unit, and a generation unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the acquisition unit may also be described as "a unit that collects played audio as audio to be processed".

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: collecting played audio as audio to be processed; inputting the audio to be processed into a preset depth neural network model to obtain the position of an approximate wake-up audio fragment of the audio to be processed in the audio to be processed, wherein the depth neural network model is used for predicting the position of the approximate audio fragment of wake-up word audio in the input audio; and determining a false wake-up fragment in the audio to be processed based on the position of the approximate wake-up audio fragment, and generating a false wake-up fragment set comprising the false wake-up fragment.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A method of obtaining false wake-up audio, the method comprising:

collecting played audio as audio to be processed;

inputting the audio to be processed into a preset depth neural network model to obtain the position of an approximate wake-up audio fragment of the audio to be processed in the audio to be processed, wherein the depth neural network model is used for predicting the position of the approximate audio fragment of wake-up word audio in the input audio;

determining a false wake-up fragment in the audio to be processed based on the position of the approximate wake-up audio fragment, and generating a false wake-up fragment set comprising the false wake-up fragment;

responsive to a number of locations of the output approximate wake-up audio segments being less than a target number threshold, wherein the target number threshold is associated with a duration of the audio to be processed, lowering a confidence threshold of the deep neural network model;

and re-inputting the audio to be processed into a depth neural network model with the confidence coefficient threshold value lowered to obtain the position of the approximate wake-up audio fragment in the audio to be processed and the wake-up confidence coefficient, determining the false wake-up fragment based on the position of the approximate wake-up audio fragment, splicing the false wake-up fragments, and updating the false wake-up fragment set by using the splicing result, wherein the duration of the splicing result of each false wake-up fragment in the false wake-up fragment set is longer than the target duration.

2. The method of claim 1, wherein the location comprises a start point and an end point;

the determining a false wake-up segment in the audio to be processed based on the position of the approximate wake-up audio segment comprises:

determining the starting point of the audio fragment with the first preset duration in the previous direction in the audio to be processed as a target starting point at the starting point of the approximate wake-up audio fragment;

determining the end point of the audio fragment with the second preset duration in the backward direction in the audio to be processed as a target end point at the end point of the approximate wake-up audio fragment;

and extracting the audio fragments corresponding to the target starting point to the target ending point, and determining the extracted audio fragments as false wake-up fragments.

3. The method of claim 1 or 2, wherein the deriving the approximate wake-up audio piece position of the audio to be processed in the audio to be processed comprises:

and obtaining the awakening confidence of the approximate awakening audio fragment and the position in the audio to be processed, which are output from the deep neural network model, wherein the awakening confidence of the approximate awakening audio fragment for awakening word audio is higher than a confidence threshold of the deep neural network model.

4. The method of claim 3, wherein the generating the set of false wake segments comprising the false wake segment comprises:

and splicing each false wake-up fragment to generate a false wake-up fragment set comprising a splicing result.

5. A method according to claim 3, wherein the method further comprises:

obtaining wake-up results of the sound box play the false wake-up fragment set on the tested equipment, wherein the sound boxes are placed at a plurality of angles;

if the number of the false wake-up fragments of the device to be tested in the wake-up result reaches a specified number threshold, determining the false wake-up fragment set as an effective set, wherein the specified number threshold is related to the number of the false wake-up fragments included in the false wake-up fragment set;

and if the number of the false wake-up fragments of the device to be tested in the wake-up result does not reach the specified number threshold, determining that the false wake-up fragment set is an invalid set.

6. The method of claim 5, wherein the method further comprises:

and if the false wake-up fragment set is determined to be an invalid set, the confidence threshold is increased.

7. An apparatus for obtaining false wake-up audio, the apparatus comprising:

the acquisition unit is configured to acquire played audio as audio to be processed;

the prediction unit is configured to input the audio to be processed into a preset depth neural network model to obtain the position of an approximate wake-up audio fragment of the audio to be processed in the audio to be processed, wherein the depth neural network model is used for predicting the position of the approximate audio fragment of wake-up word audio in the input audio;

a generating unit configured to determine a false wake-up section in the audio to be processed based on the position of the approximate wake-up audio section, and generate a false wake-up section set including the false wake-up section;

a turn-down unit configured to turn down a confidence threshold of the deep neural network model in response to a number of locations of the output approximate wake-up audio segments being less than a target number threshold, wherein the target number threshold is associated with a duration of the audio to be processed;

the splicing unit is configured to re-input the audio to be processed into the depth neural network model with the confidence coefficient threshold value lowered, obtain the position and the wake-up confidence coefficient of the approximate wake-up audio fragment in the audio to be processed, determine the false wake-up fragment based on the position of the approximate wake-up audio fragment, splice the false wake-up fragment, and update the false wake-up fragment set by using the splicing result, wherein the duration of the splicing result of each false wake-up fragment in the false wake-up fragment set is longer than the target duration.

8. The apparatus of claim 7, wherein the location comprises a start point and an end point;

the generating unit is further configured to perform the determining of a false wake up segment in the audio to be processed based on the location of the approximate wake up audio segment in the following manner:

9. The apparatus according to claim 7 or 8, wherein the prediction unit is further configured to perform the deriving the position of the approximate wake-up audio piece of the audio to be processed in the following manner:

10. The apparatus of claim 9, wherein the generating unit is further configured to perform the generating a set of false wake segments comprising the false wake segment as follows:

11. The apparatus of claim 9, wherein the apparatus further comprises:

the result acquisition unit is configured to acquire wake-up results of the error wake-up fragment set on the tested equipment played by the sound box placed at a plurality of angles;

a first determining unit configured to determine that the false wake-up fragment set is an active set if, in the wake-up result, the number of false wake-up fragments of the device under test reaches a specified number threshold, where the specified number threshold is associated with the number of false wake-up fragments included in the false wake-up fragment set;

and the second determining unit is configured to determine that the false wake-up fragment set is an invalid set if the number of false wake-up fragments of the device under test in the wake-up result does not reach the specified number threshold.

12. The apparatus of claim 11, wherein the apparatus further comprises:

and a heightening unit configured to heighten the confidence threshold value in the case that the false wake-up fragment set is determined to be an invalid set.

13. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.

14. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-6.