Microsoft Research has introduced a new AI model called Magma, which could mark a big advancement in artificial intelligence for controlling both software interfaces and robotic systems. Magma combines visual and language processing, allowing it to operate in both the digital and physical world, making it a potentially versatile AI model.
Unlike many existing multimodal AI systems that rely on separate models to interpret data and perform actions, Magma integrates these capabilities into one system. Microsoft claims this makes Magma unique, as it can process data like text, images, and video and act upon it natively, whether navigating software or controlling robots. This advancement could lead to more autonomous and intelligent AI systems that are capable of operating across various scenarios.
Magma’s development has been a collaborative effort between Microsoft and prominent academic institutions, including KAIST, the University of Maryland, the University of Wisconsin-Madison, and the University of Washington. The AI aims to move beyond simply answering questions or executing single commands, as Microsoft envisions it as a step toward creating an agentic AI system. This means the AI could autonomously plan and perform multistep tasks to achieve complex goals without human intervention.
In its research, Microsoft highlights how Magma can craft plans based on a described goal and take actions to fulfill that objective. By leveraging available visual and language data, Magma can handle intricate tasks in both virtual and physical settings, which could have a wide range of applications in industries like manufacturing, healthcare, and digital automation.
Other tech companies like OpenAI and Google are also exploring the potential of agentic AI. OpenAI’s experiments with projects like Operator focus on performing tasks in web browsers, while Google has been developing agentic AI with its Gemini 2.0 initiative. However, what makes Magma different is its integrated approach to perception and action, potentially giving it an edge in real-world applications.