#10 The world of AI is developing extremely fast this week
This week in our bi-weekly innovation substack: ElevenLabs dazzles with its ability to breathe life into silent videos through AI-generated sound effects, while Google Deepmind's Gemini 1.5 shatters expectations with a context window large enough to delve deeper into complex problem-solving than ever before. Meanwhile, Boximator introduces a revolutionary way to command motion in AI-generated videos with unmatched precision, and Groq's innovative processor units turbocharge AI model responses, ushering in an era of nearly instantaneous digital interactions. Let’s go!
ElevenLabs developed an AI to generate sound effects
OpenAI's Sora impresses with its capabilities, yet lacks audio; ElevenLabs' latest innovation aims to fix this with new background sound technology.
ElevenLabs is boldly suggesting we move beyond traditional sound design into a new era where enhanced audio for video clips is just a prompt away.
On X, ElevenLabs recently showcased a demonstration through a brief video featuring AI-generated sound, using footage from OpenAI's Sora—the same clips that captivated many of us. Check it out below:
Caption by Elevenlabs: We were blown away by the Sora announcement but felt it needed something... What if you could describe a sound and generate it with AI?
Google Deepmind Gemini 1.5 is really good at problem solving
Just a few weeks ago, Google introduced its GPT-4 rival, dubbed Gemini 1.0 Ultra. According to benchmarks, it's poised to match or even surpass the performance of GPT-4. However, Google hasn't rested; it swiftly followed up with the launch of Gemini 1.5 Pro last week. This update's key enhancement is the expansion of the context window size from 128,000 tokens to an impressive 1 million tokens. The context window represents the volume of content that can be input into a prompt. To give a rough idea, approximately 100 characters equate to 27 tokens, meaning 1 million tokens can accommodate around 4 million characters—a significant amount of text.
So, how does a 1 million token context window compare to its major rivals like GPT-4 and Claud 2? As it stands, competitors aren't even in the same league. While 100,000 (GPT-4) and 200,000 (Claud 2) tokens already represent a substantial volume of text, they pale in comparison to the 1 million tokens offered by Gemini 1.5 Pro. This substantial increase in the context window gives Gemini its first significant edge over the competition. The ability to process such a vast amount of text opens up exciting possibilities, as illustrated by the video examples provided below.
First example is of Gemini problem solving across 100,633 lines of code.
Or this example of Gemini searching a specific scenario from a uploaded film.
Motion control for AI generated video by Boximator
The launch of Sora overshadowed Boximator, an AI model deserving of more recognition. Boximator enhances user control over video generation models. Currently, most video generation tools operate based on textual prompts to create videos. While this text-based control is innovative, it offers limited influence over the specific movements within a video. Boximator addresses this issue by granting users more precise control over the movements in AI-generated videos.
A notable yet understated feature of Boximator is its compatibility with both existing and forthcoming video generation models. It is designed to integrate seamlessly with other video diffusion models, such as Sora. This compatibility makes Boximator especially compelling as models of Sora's calibre become more accessible.
Instant Content Creation Unleashed with Groq (not the one of Elon)
So, what exactly is Groq? It's certainly not the Twitter/X project known for its ability to generate colorful language, despite the similar name. Instead, Groq stands apart as a pioneer in the technology sector, specializing in the creation of processor units named LPUs. These chips are tailor-made to power Generative AI (Gen-AI) models, offering a unique proposition. What distinguishes LPUs is their remarkable speed advantage over the traditional GPUs currently employed by Gen-AI models like ChatGPT and Gemini. This means users can receive responses from Groq's technology almost instantaneously, eliminating the seconds or minutes typically associated with ChatGPT outputs.
But how does Groq compare in performance to well-known models? Let's look at the numbers:
GPT-3.5 processes about 166 tokens per second.
GPT-4 processes around 110 tokens per second.
Groq, when used in conjunction with the Mixtral 8x7B-32K model, processes approximately 500 tokens per second.
Claud 2 processes about 64 tokens per second.
On some days, Groq's processing speed can exceed four times that of GPT-3.5 while utilizing the better performing Mixtral 8x7B-32K model. This efficiency positions Groq as an enticing alternative for applications where speed is paramount. Moreover, it's four times less expensive than GPT-3.5 and over a hundred times cheaper than GPT-4. These cost and performance efficiencies make Groq an ideal solution for speech applications or real-time personalization in web development.
Why our jaws dropped when we saw OpenAI's Sora: a gigantic step forward for text-to-video generation models
In the evolving landscape of generative artificial intelligence, OpenAI's introduction of Sora, a sophisticated text-to-video model, marks a significant milestone. This innovative technology extends beyond the realms of static imagery and simple animations, offering the ability to create detailed, minute-long videos from textual prompts. Sora's development underscores a remarkable leap towards understanding and simulating the complexities of the physical world in motion, showcasing the potential to revolutionize content creation across various sectors.
https://www.livewall.co/blogs/openai-sora-text-to-video-revolution