The Future of Large Scale Multi-modal and Multi-task Learning

Gaurav Sharma

Gaurav Sharma · Head of AI Research

June 18th, 2024 · 6 min read

Not too many moons ago, taking pictures and having them printed used to be a big task. We would borrow a Nikon film camera from our uncle and go buy a film roll. The most expensive roll would have 36 pictures and the first one or two would get spoiled while loading the film on the camera. After finishing the roll, we would take it to the photo printing shop, who would have them ready for us in a day or two. Video cameras were a rare luxury and very hard to get our hands on, and only the expensive bulky cassette players had audio recording to magnetic tapes. 

Fast forward to now, everyone has a phone with a camera in their pocket and we are generously taking pictures, recording audio and video, not just for saving memories, but also for basic communication. Creating and consuming all these modalities have become a way of life. 

A similar evolution has happened for text. We have sophisticated software on our phones to typeset beautiful text documents which we can print on printers at home. 

3D is also about to become very accessible. Many modern phones are coming with 3D sensors, and the software which converts videos of objects to their 3D representations are maturing quickly to become a commodity. 

The creation of all the major modalities -- text, image, video and 3D -- has been truly democratized. We all now have easy access to tools for working with these modalities and we are creating enormous amounts of data in all these modalities. 

A similar story is taking place for even the most advanced AI based tools available for these modalities. For example, it used to take a four-person team one month and hardware worth tens of thousands of dollars to make a system which could analyze reviews on an ecommerce webpage, just a few years ago. Now anyone can fire up one of the many available LLM services in a web browser on a mobile phone and ask the LLM to summarize the sentiment in the reviews. The rapid advancement in AI tools have given superpowers to everyone with basic levels of access to computing devices and the internet. 

While powerful, many such AI tools still specialize to a limited number of tasks on one modality. Some may do text very well, e.g. LLMs, and others may work great for audio/speech, e.g. Alexa or Siri. Such single-modality systems are currently available and are delivering good results.  

However, as humans, we do not work with modalities and tasks in isolation. We learn to associate them in complex ways and transfer our learning from one modality to another. For example, we could associate certain sounds with certain appearances, e.g. a specific type of chirping for specific birds, and then infer the kind of sound a similar type of bird might make without ever having heard the new type of birds. This is an oversimplification, and many of you might object and say that this is not necessarily true in general. I would agree, but the point is that such correlations exist to varying degrees for different tasks and modalities, and humans leverage them to learn faster and more robustly. The next wave of AI systems will do the same as well. 

Recent versions of popular LLMs are now going multi-modal, e.g. they can now take images, and some even videos, as inputs, describe them and answer questions about their content. They are now learning over two modalities -- images and text! 

This week at CVPR we will spotlight more of these insights and research, in our paper, where we scale multi-modal learning and learn over 12 different modalities like image, video, audio, text, X-ray, infrared, depth maps etc. We also perform multiple tasks for the different modalities like classification or pixelwise segmentation. In this new research, we combined multiple benchmarks to make a large multi-modal benchmark, with the goal of testing large amounts of data and combining 25 public benchmark datasets to experiment with the proposed method.  

The method we propose is quite intuitive and effectively uses a mechanism to allow communication between the modalities while learning the tasks. It is based on the very successful Transformers architecture, which is now the de-facto standard for many AI tasks and is also used in popular LLMs. Unsurprisingly, we find with our experiments that learning multiple modalities and multiple tasks together gives us a boost in performance across the board and gives us an excellent generalization. 

We of course are not the only ones to propose multi-modal learning. As I mentioned before a lot of LLMs are now becoming multi-modal, at least incorporating visual, and some even audio modality. Many major academic and industrial research labs are actively working on multi-modal and multi-task learning. Apart from pure performance benefits, it is also beneficial in terms of cost, as instead of running different AI models for each modality we would run just one model which would be faster and consume less power, although this model might be larger than each of the single modality model individually. 

Eventually, creating and working with multi-modal content would be democratized as more and more AI models turn multi-modal and gain scale, popularity and become easy to use. In the very near future, what took a group of people with specialized hardware and software tools to create would probably be possible to do by each of us with a little bit of training, and easily accessible and affordable hardware. At Typeface, we are investing heavily in making that future possible by actively researching multi-modal and multi-task learning and are very excited about the future and what we can enable our customers to create, be it text, images, videos or soon even 3D. So please stay tuned, give us try and let us know what worked and what else you might like included.


Related articles


Introducing Typeface Arc: The Next Chapter in Storytelling for the AI Era

Abhay Parasnis

Abhay Parasnis · Founder and CEO

May 8th, 2024 · 7 min read


Typeface Announces GA of its New Multimodal AI Content Hub, Expands into Video with TensorTour Acquisition



January 25th, 2024 · 7 min read


Mastering Multimodal AI: Connecting Audio, Visual, and Text Across All Mediums and Branding

Sripad Sriram

Sripad Sriram · Product Manager

October 18th, 2023 · 7 min read