banner
KiWi

KiWi的博客

这里是一个搞技术的音频er的网站
wechat
email

Game Audio Retrieval Technology Based on Deep Learning

Preface#

Today, the development of large model technology is rapid. In addition to the well-known LLM large models, multimodal models have also gained a certain position in the industry. One of the more famous models, "CLIP," was developed to solve the ability of machines to understand information across the text-video domain. By using CLIP, models can "understand" the relationship between video content and text, making video generation, image generation, and other multimodal models possible.

In the audio field, a text-audio CLAP model has also been developed based on the principles of the CLIP model.

Model Introduction#

As a self-supervised model, CLAP uses contrastive learning to train the model.

CLAP maps both text and audio to the same vector space through a text encoder and an audio encoder, and then finds matching text-audio pairs by comparing vector similarities. For the text encoder, the model uses BERT, while for the audio encoder, it uses Mel spectrograms obtained from audio.

image

In terms of model performance, the author utilized several classic audio event sets for testing, focusing mainly on audio classification, music classification, and sentiment analysis, as well as SpeakingCounting. The test results are as follows:

image 1

Here, ZS indicates the zero-shot situation, and it can be seen that the performance is also quite good.

Application in Game Audio#

Initially, the CLAP model was primarily used for audio classification, such as monitoring specific alarm sounds in surveillance videos.

However, after observing the model's characteristics, the author realized that CLAP could be suitable for searching sound effect libraries.

In traditional sound effect library management, sound effects can only be searched through file names or meta tags. This search method heavily relies on the sound effect library having a complete naming convention and comprehensive meta tags to facilitate designers in finding the desired sound effects.

Additionally, some sound effects are composed of multiple sound effects, and for those sub-sound effects, their audio characteristics may not match their file names, making it necessary for designers to memorize more sound effect names when faced with similar scenarios.

Finally, many sound effect names are in English, and we must admit that not every designer possesses a proficient level of English search capability.

The intervention of the CLAP model effectively avoids these issues. First, due to the presence of the chosen text encoder, multiple languages are mapped to the same vector, allowing for a zero-cost implementation of multilingual search functionality.

Secondly, through extensive training data, CLAP continuously understands the text and its corresponding Mel spectrograms, enabling the model to be seen as truly understanding the content expressed by the sound effects, rather than merely relying on their file names. This also allows us to search for audio by describing the content of the desired sound effects in natural language.

Lastly, based on CLAP's "understanding" of audio content, we can not only search for audio through language but also search for audio through audio. Furthermore, we can input a benchmark sound effect and sequentially search for corresponding sound effects in the library based on its low, mid, and high-frequency characteristics. This allows us to quickly create "benchmark" sound effects. XD

Model Fine-tuning#

Since the original model was trained using a general audio event training set, the author found that the pre-trained CLAP model had very poor retrieval capabilities for game sound effects.

Therefore, the author believes that fine-tuning it for the game audio field is essential.

The author prepared several BoomLibrary sound effect libraries as training datasets and also prepared some as test sets.

By observing the pre-training data of the CLAP model, the author found that the text-audio pair data required for fine-tuning the CLAP model is best described using descriptive sentences rather than class labels. This leads to better fine-tuning results.

Due to confidentiality reasons, the more specific data preparation and fine-tuning processes will not be included here. Interested readers can contact me privately for discussion.

Final Effect#

Below is a simple test conducted using a command-line interactive script. Please ignore any instances where search options cannot find wav files.😀

The next step is specific application. The author plans to develop a sound effect library management retrieval software based on this. For more details, you can refer to the upcoming article "SoundLibraryPro—AI Sound Effect Library Management Software Introduction."

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.