📖 Educations

Huazhong University of Science and Technology — Wuhan, China

Sep. 2019 - Jun. 2026, Ph.D. in Control Science and Engineering, School of Artificial Intelligence and Automation, advised by Prof. Yunfeng Luo.

Research areas: Large Language Model, Reinforcement Learning, Game Theory.

🧑‍💼 Work Experiences

ByteDance Seed — Doubao Post-Training (Jul. 2026 ~ Now)

Working on the post-training of Doubao large language models at ByteDance Seed. Focused on alignment and capability enhancement through supervised fine-tuning, RLHF / RLAIF, and reasoning-oriented reward modeling, with the goal of pushing Doubao's instruction following, reasoning and tool-use abilities further.

📝 Publications

Preference-CFR Beyond Nash Equilibrium for Better Game Strategies. (ICML 2025) Proposes the Preference Counterfactual Regret Minimization (Pref-CFR) algorithm to achieve diverse Nash equilibria, enabling customizable strategies by incorporating preference and vulnerability parameters. Demonstrates distinct play styles in Texas Hold’em without sacrificing strategic strength.
Accelerating Nash Equilibrium Convergence in Monte Carlo Settings Through Counterfactual Value Based Fictitious Play (NeurIPS 2024). Introduces the Monte Carlo Counterfactual Value-Based Fictitious Play (MCCFVFP) algorithm for large-scale games, achieving 20–50\% faster convergence than standard MCCFR in complex settings like Texas Hold’em.
Real-Time Weighted Fictitious Play: Converging to Equilibrium at the Speed of $O(T^{-1})$ in Games. Presents the Real-Time Weighted Fictitious Play (RTWFP) algorithm with $O(T^{-1})$ convergence in two-player zero-sum games, extending to correlated equilibrium and continuous-time FP. Outperforms existing algorithms in scalability and speed.
ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models. Proposes ELO-Rated Sequence Rewards (ERRL), which uses ordinal preferences and ELO ratings to replace numerical rewards, achieving superior performance in long-term RL tasks like Atari benchmarks.

💻 Internship Experiences

ByteDance Seed — LLM Post-training Based on Games (Jun. 2025 ~ Oct. 2025)

Project Goal: integrate game-theoretic problems into LLM post-training to enhance LLM capabilities.
Project Results: delivered game-oriented LLM evaluation & training frameworks; trained an LLM Texas Hold'em AI that outperformed GPT-o3, Grok-4, etc., with improved instruction following.
Personal Work: proposed a novel algorithm with an "LLM Reflection" mechanism that outperforms traditional RL methods in game scenarios; built the game-oriented LLM evaluation framework (integrated into the team's system) and a training framework based on Verl.

vivo — SD Model Fine-Tuning via Reinforcement Learning (Feb. 2025 ~ Apr. 2025)

Project Goal: fine-tune Stable Diffusion with reinforcement learning to improve generation quality and prompt alignment.
Project Results: early results show clear improvements in image quality, textual relevance and human preference alignment; ongoing work on reward design and distributed training scale-up.
Personal Work: designed a composite reward model combining aesthetics, textual relevance, diversity and human feedback; tuned the RL training pipeline for SD.

Fen AI — Texas Hold’em AI (Sep. 2023 ~ Jan. 2024)

Project Goal: create an AI that matches the performance of Pluribus, a renowned Texas Hold’em AI.
Project Results: the final AI reached the level of professional players in two-player Texas Hold’em; the multi-player version is still under development.
Personal Work: contributed to key algorithms (MCCFR, MCCFR pruning), built foundational components such as strategy storage and result visualization, and handled algorithm parameter tuning and testing.

ByteDance Nuverse — Reinforcement Learning Internship (Jul. 2021 ~ Mar. 2022)

Project Goal: design multi-style AI companion NPCs for the game One Piece: Burning Blood.
Project Results: added a style evolution module on top of the previous AI training framework, leading to an 80–120% improvement in key indicators and clear style differentiation in play; several AIs reached deployable quality.
Personal Work: served as the main executor of the project. Under my advisor's guidance, I implemented the multi-style AI algorithm and explored the integration of human preferences into reinforcement learning, which culminated in a research paper summarizing the findings and potential applications.

China Resources Group — Land Auction (Feb. 2021 ~ Jun. 2021)

Project Goal: design a bidding strategy for China Resources Group in the “first/last” land auction.
Project Results: the strategy was approved by China Resources Land Group and deployed in dozens of land auctions (each over $100M). The algorithm outperformed the group’s expert approach, boosting bid accuracy 3–4× and winning probability by ~5%, and was adopted as their standard land auction strategy.
Personal Work: built a simulator of the “first/last” land auction from historical data and applied the Fictitious Play algorithm to develop strategies. Participated in three real auctions involving a total bidding scale of $1B, and refined the model based on real-world outcomes.

📝 Latest Blog Posts

From AlphaGo to Longzhong Dui: High-Quality Reasoning Under Limited Information

May 24, 2026

AlphaGo can crush human grandmasters on a closed board, yet it cannot produce a *Longzhong Dui*-style open strategic analysis. This post compares MCTS with human decision-making and argues that the real gap is not 'mo...

From Self-Play to AGI: A Plausible Path and Six Unresolved Fundamental Problems

May 15, 2026

Self-Play has driven breakthroughs in closed environments, but the leap toward AGI requires far more than scaling. This post outlines six interlocking problems that any self-play system must solve to truly grow like a...

View all posts →

🌱 About Me

Cheerful, optimistic, and thrives on tackling challenging tasks. Passionate about Go (WeiQi), gaming, and football.

Go (WeiQi): roughly 7 dan amateur on Fox Weiqi (野狐围棋).
Gaming: my long-time favorite is StarCraft II — I once reached Master league on the China server.
Football: lifelong fan of Borussia Dortmund, with Marco Reus being my favorite player.

One funny coincidence I cannot help noticing: almost everything I happen to love — Go, StarCraft II, football — has at some point become a research target of DeepMind. That coincidence is part of why I deeply believe in the power of AI. My goal is to push the frontier of AGI through reinforcement learning, game theory, and multi-agent systems on top of large language models, on both the research side and real-world applications.

居奇 (Ju Qi)