📖 Educations

Huazhong University of Science and Technology — Wuhan, China

Sep. 2019 - Jun. 2026, Ph.D. in Control Science and Engineering, School of Artificial Intelligence and Automation, advised by Prof. Yunfeng Luo.

Research areas: Large Language Model, Reinforcement Learning, Game Theory.

📝 Publications

  1. Preference-CFR Beyond Nash Equilibrium for Better Game Strategies. (ICML 2025) Proposes the Preference Counterfactual Regret Minimization (Pref-CFR) algorithm to achieve diverse Nash equilibria, enabling customizable strategies by incorporating preference and vulnerability parameters. Demonstrates distinct play styles in Texas Hold’em without sacrificing strategic strength.
  2. Accelerating Nash Equilibrium Convergence in Monte Carlo Settings Through Counterfactual Value Based Fictitious Play (NeurIPS 2024). Introduces the Monte Carlo Counterfactual Value-Based Fictitious Play (MCCFVFP) algorithm for large-scale games, achieving 20–50\% faster convergence than standard MCCFR in complex settings like Texas Hold’em.
  3. Real-Time Weighted Fictitious Play: Converging to Equilibrium at the Speed of $O(T^{-1})$ in Games. Presents the Real-Time Weighted Fictitious Play (RTWFP) algorithm with $O(T^{-1})$ convergence in two-player zero-sum games, extending to correlated equilibrium and continuous-time FP. Outperforms existing algorithms in scalability and speed.
  4. ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models. Proposes ELO-Rated Sequence Rewards (ERRL), which uses ordinal preferences and ELO ratings to replace numerical rewards, achieving superior performance in long-term RL tasks like Atari benchmarks.

🧑‍💼 Work Experiences

ByteDance Seed — Doubao Post-Training (Jul. 2026 ~ Now)

Working on the post-training of Doubao large language models at ByteDance Seed. Focused on alignment and capability enhancement through supervised fine-tuning, RLHF / RLAIF, and reasoning-oriented reward modeling, with the goal of pushing Doubao’s instruction following, reasoning and tool-use abilities further.

💻 Internship Experiences

ByteDance Seed — LLM Post-training Based on Games (Jun. 2025 ~ Oct. 2025)

Individual Contributor.

  • Project Goal: integrate game-theoretic problems into LLM post-training to enhance LLM capabilities.
  • Project Results: delivered game-oriented LLM evaluation & training frameworks; trained an LLM Texas Hold'em AI that outperformed GPT-o3, Grok-4, etc., with improved instruction following.
  • Personal Work: proposed a novel algorithm with an "LLM Reflection" mechanism that outperforms traditional RL methods in game scenarios; built the game-oriented LLM evaluation framework (integrated into the team's system) and a training framework based on Verl.

vivo — SD Model Fine-Tuning via Reinforcement Learning (Feb. 2025 ~ Apr. 2025)

  • Project Goal: fine-tune Stable Diffusion with reinforcement learning to improve generation quality and prompt alignment.
  • Project Results: early results show clear improvements in image quality, textual relevance and human preference alignment; ongoing work on reward design and distributed training scale-up.
  • Personal Work: designed a composite reward model combining aesthetics, textual relevance, diversity and human feedback; tuned the RL training pipeline for SD.

Fen AI Lab — Texas Hold’em AI (Sep. 2023 ~ Jan. 2024)

Project member, team of four.

  • Project Goal: create an AI that matches the performance of Pluribus, a renowned Texas Hold’em AI.
  • Project Results: the final AI reached the level of professional players in two-player Texas Hold’em; the multi-player version is still under development.
  • Personal Work: contributed to key algorithms (MCCFR, MCCFR pruning), built foundational components such as strategy storage and result visualization, and handled algorithm parameter tuning and testing.

ByteDance Nuverse — Reinforcement Learning Internship (Jul. 2021 ~ Mar. 2022)

Main implementer, team of two.

  • Project Goal: design multi-style AI companion NPCs for the game One Piece: Burning Blood.
  • Project Results: added a style evolution module on top of the previous AI training framework, leading to an 80–120% improvement in key indicators and clear style differentiation in play; several AIs reached deployable quality.
  • Personal Work: served as the main executor of the project. Under my advisor's guidance, I implemented the multi-style AI algorithm and explored the integration of human preferences into reinforcement learning, which culminated in a research paper summarizing the findings and potential applications.

China Resources Group — Land Auction (Feb. 2021 ~ Jun. 2021)

Project leader, team of six.

  • Project Goal: design a bidding strategy for China Resources Group in the “first/last” land auction.
  • Project Results: the strategy was approved by China Resources Land Group and deployed in dozens of land auctions (each over $100M). The algorithm outperformed the group’s expert approach, boosting bid accuracy 3–4× and winning probability by ~5%, and was adopted as their standard land auction strategy.
  • Personal Work: built a simulator of the “first/last” land auction from historical data and applied the Fictitious Play algorithm to develop strategies. Participated in three real auctions involving a total bidding scale of $1B, and refined the model based on real-world outcomes.