It’s time to celebrate the amazing women pioneering AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. He learns more

Today Microsoft theShRe-Amnesty International The team has dropped a new Vision Foundation model called Florence-2 Face hugging.

Available under a permissive MIT license, Model It can handle a variety of vision and vision language tasks using a unified, vectorized representation. It comes in two sizes – 232M and 771M parameters – and already excels at tasks like captioning, object detection, visual grounding and segmentation, performing on par with or better than many large vision models available.

While the real world Model performance Not yet tested, the work is expected to give organizations a single, unified approach to handling different types of vision applications. This will save investments in separate mission-specific vision models that fail to go beyond their basic function, without extensive fine-tuning.

What makes Florence-2 unique?

today, Large linguistic models LLMs are at the heart of an organization’s operations. A single model can file briefs, write marketing copy, and even handle customer service in many cases. The level of adaptability across domains and tasks has been amazing. But this success also made researchers wonder: Could vision paradigms, which were largely task-specific, do the same?

Registration for VB Transform 2024 is open

Join enterprise leaders in San Francisco July 9-11 for our flagship AI event. Connect with your peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications into your industry. Register now

In essence, vision tasks are more complex than text-based natural language processing (NLP). It requires comprehensive cognitive ability. Basically, to achieve global representation of Various vision tasks,The model must be able to understand spatial data across different scales, from broad image-level concepts such as object location, to fine pixel details, as well as semantic details such as high-level captions to detailed descriptions.

When Microsoft tried to solve this problem, it found two major obstacles: the scarcity of comprehensively annotated visual datasets and the absence of a unified pre-training framework with a unique network architecture that integrates the ability to understand spatial hierarchy and semantic detail.

To address this problem, the company first used specialized models to create a visual dataset called FLD-5B. It included a total of 5.4 billion annotations for 126 million images, covering details from high-level descriptions to specific regions and objects. Then, using this data, it trained FLORENCE-2, which uses a sequence-to-sequence architecture (a type of neural network designed for tasks involving sequential data) that combines an image encoder and a multi-modal decoder and encoder. This enables the model to handle different vision tasks, without requiring task-specific architectural modifications.

“All annotations in the dataset, FLD-5B, are standardized into text output, facilitating a unified multi-task learning approach with consistent optimization with the same loss function as the target,” the researchers wrote in the study. paper Detail the model. “The result is a versatile vision foundation model capable of performing a variety of tasks…all within a single model governed by a unified set of parameters. The task is activated through text prompts, mirroring the approach used by large language models.”

Better performance than larger models

When asked to enter images and text, Florence-2 handles a variety of tasks, including object detection, annotations, visual grounding, and visual question answering. Most importantly, they offer this at a quality that is on par with or better than many larger models.

For example, in the zero-caption test on the COCO dataset, both the 232M and 771M versions of Florence outperformed Deep Mind 80B parameter Flamingo visual language model with scores of 133 and 135.6 respectively. They performed better than Microsoft’s Kosmos-2 optical grounding model.

When FLORENCE-2 was fine-tuned using human-annotated public data, it, despite its small size, was able to compete closely with many larger specialized models across tasks such as visual question answering.

“The pre-trained Florence-2 backbone enhances performance on downstream tasks, such as COCO object detection and instance segmentation, and ADE20K semantic segmentation, outperforming both supervised and self-supervised models,” the researchers noted. “Compared to models pre-trained on ImageNet, our model improves training efficiency by 4X and achieves substantial improvements of 6.9, 5.5, and 5.9 points on the COCO and ADE20K datasets.”

As of now, both pre-trained and optimized versions of Florence-2 232M and 771M are available on Face hugging Under a permissive Massachusetts Institute of Technology (MIT) license that allows unrestricted distribution and modification for commercial or private use.

It will be interesting to see how developers will use it and eliminate the need for separate vision models for different tasks. Small, task-unrelated models can not only save developers the need to work with different models, but can also reduce computing costs by a large margin.

Leave a Reply

Your email address will not be published. Required fields are marked *