04/24/2026
The SAM family has kept refining on interaction. The original used points and boxes for . SAM2 extended to .
introduced Promptable Concept Segmentation (PCS), locating all instances in an image that match a given noun phrase. But for longer, more complex natural language instructions, SAM3 has to route through an external to translate them into noun phrases first. That makes the system heavier, and fine-grained meaning can get lost along the way.
A recent multi-institution research team proposes SAM3-I (Segment Anything with Instructions), defining a new task, Promptable Instruction Segmentation (P*S). It gives the family a direct path to handle complex natural-language instructions, without routing through an LLM middle layer.
๐ ๐๐๐ฉ๐๐ซ: https://arxiv.org/abs/2512.04585
๐ ๐๐ฉ๐๐ง-๐ฌ๐จ๐ฎ๐ซ๐๐๐ ๐จ๐ง: https://github.com/debby-0527/SAM3-I
SAM3-I organizes instructions by difficulty into Concept / Simple / Complex levels. On SAM3โs text side, it inserts an Instruction-Aware Cascaded Adapter that learns progressively across these levels. The S-Adapter focuses on explicit conditions like attributes and location. The C-Adapter builds on that to handle functional descriptions and implicit reasoning. They mirror how humans move from catching keywords to deeper comprehension.
The team also designs four complementary distribution-alignment losses, aiming for the same object to be understood the same way, whether the instruction is a short description or a longer reasoning chain.
To support these, they build the HMPL-Instruct with 840k instructions, covering concept to reasoning, object to part, single-instance to multi-instance.
On simple instructions, SAM3-I outperforms the SAM3 Agent baseline by 31.3 absolute points in gIoU. On complex instructions, the margin is 22.6 points. It uses 1/8 of the parameters and requires only a single forward pass.
The work shows that segmentation can acquire complex language understanding through parameter-efficient adaptation, without giving up existing capabilities. With larger instruction data and more dialog-style interaction, general-purpose segmentation that follows real human instructions is starting to look practical.