Vision Transformers Need Registers

Question

The discussion centers around the implementation of multiple class (CLS) tokens in Vision Transformers (ViTs) and their impact on performance in neural networks. One commenter shares their experience of experimenting with multiple global tokens in a chess neural network, which did not outperform a baseline setup with a single token. This raises questions about the effectiveness of using multiple CLS tokens in improving outcomes in models utilizing ViTs. Additionally, links to previous discussions on related topics suggest that the concept of multiple CLS tokens is not new, and there is an interest in whether this approach has seen broader application or acceptance in the field since those prior conversations.

Vision Transformers Need Registers

0 Answers