The post discusses an innovative benchmark designed to test the social skills of large language models (LLMs) through an elimination game format. Comments highlight the relevance to digital assistants and role-playing applications. Users critique the benchmark setup, suggesting that reasoning models gain an advantage due to their ability to maintain private plans alongside public text outputs. Suggestions for improving the setup include allowing all models the capability to formulate hidden plans and requesting explicit planning, which has shown effectiveness in training social tasks. Other observations note the brevity of individual rounds, questioning the adequacy for human-like coordination, and express interest in having accessible code for further experimentation. There’s also curiosity about human performance in comparison to LLMs in this context.