📖 Educations

Huazhong University of Science and Technology — Wuhan, China
Sep. 2019 - Jun. 2026, Ph.D. in Control Science and Engineering, School of Artificial Intelligence and Automation, advised by Prof. Yunfeng Luo.
Research areas: Large Language Model, Reinforcement Learning, Game Theory.
📝 Publications
- Preference-CFR Beyond Nash Equilibrium for Better Game Strategies. (ICML 2025) Proposes the Preference Counterfactual Regret Minimization (Pref-CFR) algorithm to achieve diverse Nash equilibria, enabling customizable strategies by incorporating preference and vulnerability parameters. Demonstrates distinct play styles in Texas Hold’em without sacrificing strategic strength.
- Accelerating Nash Equilibrium Convergence in Monte Carlo Settings Through Counterfactual Value Based Fictitious Play (NeurIPS 2024). Introduces the Monte Carlo Counterfactual Value-Based Fictitious Play (MCCFVFP) algorithm for large-scale games, achieving 20–50\% faster convergence than standard MCCFR in complex settings like Texas Hold’em.
- Real-Time Weighted Fictitious Play: Converging to Equilibrium at the Speed of $O(T^{-1})$ in Games. Presents the Real-Time Weighted Fictitious Play (RTWFP) algorithm with $O(T^{-1})$ convergence in two-player zero-sum games, extending to correlated equilibrium and continuous-time FP. Outperforms existing algorithms in scalability and speed.
- ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models. Proposes ELO-Rated Sequence Rewards (ERRL), which uses ordinal preferences and ELO ratings to replace numerical rewards, achieving superior performance in long-term RL tasks like Atari benchmarks.
🧑💼 Work Experiences
ByteDance Seed — Doubao Post-Training (Jul. 2026 ~ Now)
Working on the post-training of Doubao large language models at ByteDance Seed. Focused on alignment and capability enhancement through supervised fine-tuning, RLHF / RLAIF, and reasoning-oriented reward modeling, with the goal of pushing Doubao’s instruction following, reasoning and tool-use abilities further.
💻 Internship Experiences

ByteDance Seed — LLM Post-training Based on Games (Jun. 2025 ~ Oct. 2025)
Individual Contributor.
- Project Goal: integrate game-theoretic problems into LLM post-training to enhance LLM capabilities.
- Project Results: delivered game-oriented LLM evaluation & training frameworks; trained an LLM Texas Hold'em AI that outperformed GPT-o3, Grok-4, etc., with improved instruction following.
- Personal Work: proposed a novel algorithm with an "LLM Reflection" mechanism that outperforms traditional RL methods in game scenarios; built the game-oriented LLM evaluation framework (integrated into the team's system) and a training framework based on Verl.

vivo — SD Model Fine-Tuning via Reinforcement Learning (Feb. 2025 ~ Apr. 2025)
- Project Goal: fine-tune Stable Diffusion with reinforcement learning to improve generation quality and prompt alignment.
- Project Results: early results show clear improvements in image quality, textual relevance and human preference alignment; ongoing work on reward design and distributed training scale-up.
- Personal Work: designed a composite reward model combining aesthetics, textual relevance, diversity and human feedback; tuned the RL training pipeline for SD.
Fen AI Lab — Texas Hold’em AI (Sep. 2023 ~ Jan. 2024)
Project member, team of four.
- Project Goal: create an AI that matches the performance of Pluribus, a renowned Texas Hold’em AI.
- Project Results: the final AI reached the level of professional players in two-player Texas Hold’em; the multi-player version is still under development.
- Personal Work: contributed to key algorithms (MCCFR, MCCFR pruning), built foundational components such as strategy storage and result visualization, and handled algorithm parameter tuning and testing.

ByteDance Nuverse — Reinforcement Learning Internship (Jul. 2021 ~ Mar. 2022)
Main implementer, team of two.
- Project Goal: design multi-style AI companion NPCs for the game One Piece: Burning Blood.
- Project Results: added a style evolution module on top of the previous AI training framework, leading to an 80–120% improvement in key indicators and clear style differentiation in play; several AIs reached deployable quality.
- Personal Work: served as the main executor of the project. Under my advisor's guidance, I implemented the multi-style AI algorithm and explored the integration of human preferences into reinforcement learning, which culminated in a research paper summarizing the findings and potential applications.
China Resources Group — Land Auction (Feb. 2021 ~ Jun. 2021)
Project leader, team of six.
- Project Goal: design a bidding strategy for China Resources Group in the “first/last” land auction.
- Project Results: the strategy was approved by China Resources Land Group and deployed in dozens of land auctions (each over $100M). The algorithm outperformed the group’s expert approach, boosting bid accuracy 3–4× and winning probability by ~5%, and was adopted as their standard land auction strategy.
- Personal Work: built a simulator of the “first/last” land auction from historical data and applied the Fictitious Play algorithm to develop strategies. Participated in three real auctions involving a total bidding scale of $1B, and refined the model based on real-world outcomes.