-
The Hong Kong University of Science and Technology
- jxhe.github.io
- @junxian_he
Stars
A curated list of awesome Claude Skills, resources, and tools for customizing Claude AI workflows
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
MiniMax-M2, a model built for Max coding & agentic workflows.
[ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environment…
slime is an LLM post-training framework for RL Scaling.
A clean, modular SDK for building AI agents with OpenHands V1.
The official repo of "WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents"
The official code repository for the paper "Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions".
A benchmark for LLMs on complicated tasks in the terminal
An open-source AI agent that brings the power of Gemini directly into your terminal.
Renderer for the harmony response format to be used with gpt-oss
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
LEAKED SYSTEM PROMPTS FOR CHATGPT, GEMINI, GROK, CLAUDE, PERPLEXITY, CURSOR, DEVIN, REPLIT, AND MORE! - AI SYSTEMS TRANSPARENCY FOR ALL! 👐
Connect APIs, remarkably fast. Free for developers.
An extremely fast Python package and project manager, written in Rust.
The official Python SDK for Model Context Protocol servers and clients
Kimi K2 is the large language model series developed by Moonshot AI team
Democratizing Reinforcement Learning for LLMs
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
An agent benchmark with tasks in a simulated software company.
Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)