Multimodal Adaptive Recognition System & Sound2Sound: A Music Information Retrieval Application

Minsook Moon




Introduction
Among all new technologies that have been emerging 21st century, mutual exchange of information between human and computer is one of the most actively discussed topics. Especially in the music information retrieval, direct communication between human and computer could be really effective when people are looking for musical information. In this case, a computer system acts as an intermediary by analyzing human voice and connecting it to pre-recorded human voices. As an intermediary, this system should match features of both voices.
Whether written or recorded music, the amount of music information has been growing every year and music information needs also have been increased followed by the growth. As a result, lots of technologies have developed to retrieve the music they want among those huge number of music information. Due to the fact that music information is different from existing text-based information, there are lots of different approaches when retrieve sonic information. In other words, people can use sound such as singing, humming, or whistling when they looking for music information.
Multimodal Adaptive Recognition System, MARS uses this concept. This technology make possible to recognize human voice to human voice by matching multiple features but without any manual indexing. “Midomi.com” is a search engine especially in music uses MARS. Basically, MARS extracts various features from the human voice and match existing voices to find the information. And Sound2Sound technology contributes to music information retrieval by extracting “sound features” from the database. And this technology is also used for music information retrieval especially on smartphones.
By enabling this matching aspect, we can talk with computer and get the music information from the system. And it can be one of future information retrieval technologies and those two technologies will be discussed below.

Background of technologies
MARS was developed by SoundHound, Incorporation, a sound search company, delivers music search and discovery solution. The company was formerly known as Melodis Corporation and changed its name to SoundHound, Inc. in May 2010. The company was incorporated in 2005 and is based San Jose, California.
The unique MARS Search is the world's most scalable voice-activated music search technology because it extracts a variety of features from the tune including speech, pitch, tempo, and even the location of pauses. With its next generation search tools, MARS Search is the first search engine that can match human voice to human voice by analyzing these multiple features.
The company now offers Sound2Sound technology, which searches sound against sound bypassing traditional sound to text conversion techniques. It produces applications in music and voice search on various platforms, including smartphones or other devices.

MARS
What is it?
As stated above, MARS stands for Multimodal Adaptive Recognition System and this is an area of the Human & computer interaction. MARS extracts a number of features from the audio signal including pitch variation, rhythm, pause locations and speech and phonetic content. It then adapts to the query by estimating which features are more important than others. In other words, MARS extracts elements from the audio to compare to other audio files when a user is searching for a match (Colvin, 2009). For example, if the query is in the form of humming, speech content is ignored. If the user sings the lyrics as well, the search takes into account speech and phonetic content as well.

Features
In MARS technology, users still can search for songs by using a traditional text-based search words such as artist, song, or album. But with this MARS, users can search those musical information by singing, humming, and whistling using microphone. And when user’s singing matches among existing songs that recorded by other users and the system will give the song that the user wanted.
This technology identified pitch variation, rhythm information, location of pauses, phonetic content and speech content in various songs and matches these with songs in the Midomi database.
Also more weight is given to stronger components. For example, phonetic and speech content would play more heavily in the search of songs sung with lyrics than in songs that are hummed or whistled. MARS searches independent of key, tempo, language, or singing quality, which means that you don’t have to be a great singer in order for the search for work.
According to Melodis, MARS has 95% accuracy. It makes MARS is able to provide not machine-like results but more human-like result in the future.

Application
Currently there is no competition for MARS Search and there are many possible business applications as the technology is ideal for mobile phones, online music stores, car and home audio systems, and any other device that consumers may use to enjoy music. MARS Search has also been applied to text-based search which provides an advanced adaptive system that works around potential user errors like misspelling and mistyping.
Midomi. Midomi is one of applications using MARS to locate music information with human voice. Midomi is a music search tool that enables you to retrieve music information by using human voice. People can sing, hum, or whistle to find the music and connect with a community that shares their musical interests (“Midomi,” n.d.). Midomi was launched in January 2007 by Melodis Corporation. Melodis developed this technology to help identifying a song tune in their head all day but cannot express in words. Midomi provides 10 different languages that can be used in different countries.
The mission of Midomi is to build the most comprehensive database of searchable music. And Midomi shows the aspect of User Created Contents (UCC) so users can contribute to the database by singing, humming, or whistling in Midomi’s online recording studio in any language or genre. (“Midomi,” n.d.).

Sound2Sound
What is it?
Sound2Sound (S2S) technology is basically similar to MARC in terms of HCI when users looking for a music information. According to their website SoundHound, S2S search capable of recognizing various sound inputs including music and speech. By using this technology, we can experience both speed and accuracy when we use this technology. Therefore, SoundHound have been used this technology on phones and now this corporation developed new technology.

Features
S2S performs recognition by extracting features from the input signal and converting them to a compact and flexible Crystal representation. This input Crystal is then matched against a database of Target Crystals which have been derived from searchable content.
This is a concept of how S2S works when it recognize and match a music information.



Sound2SoundMinsookMoon.png
(Retrieved from SoundHound.com)
Like above, music information can be input to this software in the form of sound, voice, or typed words. Then, this software starts to extract sound features from the input information and matches from target database. This is called target crystal and those target crystals are automatically generated from a range of data formats including audio (recorded music or users’ voices) and non-audio data.
In this S2S model, “sound features” are very important because the ability of extracting certain features of the information tends to directly affect the search result. Therefore, determining sound feature of database would be important for this system.

Application
This S2S technology is especially for smartphones such as iPhone, iPad, Android, Windows Mobile and Symbian devices for music and voice search.

Discussion
Like this, those two techniques have developed to improve music information retrieval using sound information such as human voice or recorded music as a search term. Both MARC and S2S technology tries to improve many difficulties in terms of sound recognition. Those are following:
Speech recognition. Both technologies help those music information retrieval systems to match the user’s voice to audio or text information from the database. Using voice information is useful especially those specific area such as music.
Music identification. MARS and S2S also enable the system to recognize fast and accurate result from the database. Especially S2S can find music even if there’s little noise and this feature is one of improved feature than MARS.
Singing & humming search. Both technologies also allow users to sing, hum, or whistle when they find music information. Also the system recognizes the lyric when they sing a song.
Text search. Even if those two technologies have lots of innovative feature, users still can do the text search. This feature can support audio recognition of them and reduce error.
In short, both MARS and S2S technologies have developed to help users to find more accurate music information. These technologies based on HCI which have been studied for a long time but applying to music information retrieval is quite innovative and already loved by many users. Since this technology is deeply related on our daily lives, there will be more accurate and advanced technologies in the near future.


References
Colvin, J. (2009). Naming that tune: Mobile music information retrieval systems. Music Reference Services Quarterly, 12(1/2), 29-32.
Midomi.com. (n.d.)., Retrieved from http://www.midomi.com/
Niklfeld, G., Finan, R., & Pucher, M. (2001). Architecture for adaptive multimodal dialog systems based on VoiceXML. Retrieved from http://www.tsi.enst.fr/~chollet/Biblio/Congres/Audio/Eurospeech01/CDROM/papers/page2341.pdf.
SoundHound (n.d.)., Retrieved from http://www.soundhound.com