diff --git a/README.md b/README.md index 4dd203cf..8d450e26 100644 --- a/README.md +++ b/README.md @@ -129,7 +129,7 @@ our [documentation](docs/source_en/Usage%20Guide/Train-as-a-Service.md). | Hardware Environment | Notes | | -------------------- | ---------------------------------------------------------------- | | Nvidia GPUs | ✅ Support for BF16/Flash-Attn may be incomplete in earlier GPUs | -| Ascend NPU | ✅ Some operators may not be supported | +| Ascend NPU | ✅ FP8 is not supported on A2 and A3 due to hardware limitations | | PPU | ✅ | | CPU | Supports partial components like dataset, dataloader | diff --git a/README_ZH.md b/README_ZH.md index 5d588b39..1792a558 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -123,7 +123,7 @@ Twinkle✨支持相同的算法接口运行在单GPU、torchrun多机、Ray、Cl | 硬件环境 | 备注 | | -------- | --------------------------------------------------------------- | | Nvidia GPU | ✅ 早期 GPU 对 BF16/Flash-Attn 的支持可能不完整 | -| 昇腾 NPU | ✅ 部分算子可能不支持 | +| 昇腾 NPU | ✅ 由于硬件限制A2、A3暂不支持FP8 | | PPU | ✅ | | CPU | 支持部分组件如 dataset、dataloader | diff --git a/docs/source_en/Usage Guide/NPU-Support.md b/docs/source_en/Usage Guide/NPU-Support.md index 776d4798..55b9605a 100644 --- a/docs/source_en/Usage Guide/NPU-Support.md +++ b/docs/source_en/Usage Guide/NPU-Support.md @@ -159,6 +159,62 @@ First run a minimal import check to make sure the current environment can resolv python -c "import mindspeed.megatron_adaptor; from twinkle.model.megatron._mindspeed_runtime import ensure_mindspeed_adaptor_patched; ensure_mindspeed_adaptor_patched(); print('✓ Megatron backend imports are ready')" ``` +### 6. Qwen3.5/3.6 FLA and Triton-Ascend Version Compatibility + +**FLA Enablement Conditions** + +To use FLA (Flash Linear Attention) with Qwen3.5/3.6 on the transformers backend, the following conditions must be met: + +- Install `triton-ascend` +- `mindspeed` version `26.0.0_core_r0.12.1` + +**Triton-Ascend Version and CANN Compatibility** + +| triton-ascend | CANN | Additional Dependencies | +| --- | --- | --- | +| 3.2.0 | 8.5.x | Do not install `triton` | +| 3.2.1 | 9.0.0 | `triton` must be installed | + +**MindSpeed Version and Code Adaptation** + +The currently validated MindSpeed version is `26.0.0_core_r0.12.1`. MindSpeed repository: [https://gitcode.com/Ascend/MindSpeed](https://gitcode.com/Ascend/MindSpeed) + +If using a higher MindSpeed version, note that the following import paths in `src/twinkle/kernel/chunk_gated_delta_rule.py` may need to be adjusted to match the actual code locations in MindSpeed: + +```python +from mindspeed.lite.ops.triton.chunk_delta_h import chunk_gated_delta_rule_bwd_dhu, chunk_gated_delta_rule_fwd_h +from mindspeed.lite.ops.triton.chunk_o import chunk_bwd_dqkwg, chunk_bwd_dv_local, chunk_fwd_o +from mindspeed.lite.ops.triton.chunk_scaled_dot_kkt import chunk_scaled_dot_kkt_fwd +from mindspeed.lite.ops.triton.cumsum import chunk_local_cumsum +from mindspeed.lite.ops.triton.solve_tril import solve_tril +from mindspeed.lite.ops.triton.utils import autocast_custom_bwd, autocast_custom_fwd, input_guard +from mindspeed.lite.ops.triton.wy_fast import prepare_wy_repr_bwd, recompute_w_u_fwd +``` + +### 7. NPU Patch Environment Variable Configuration + +Twinkle enables model-layer patches by default in NPU environments. The following environment variables provide fine-grained control: + +| Environment Variable | Description | Default | +| --- | --- | --- | +| `TWINKLE_NPU_PATCH` | Master switch for all NPU optimizations | `1` (enabled) | +| `TWINKLE_NPU_FUSED_OPS` | Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA) | `1` (enabled) | +| `TWINKLE_NPU_MOE_PATCH` | Enable MoE Grouped MatMul | `1` (enabled) | +| `TWINKLE_NPU_FLA` | Enable Qwen3.5 Flash Linear Attention; set to `0` to force torch fallback | `1` (enabled) | + +**Usage examples**: + +```bash +# Disable all NPU optimizations and fall back to native Transformers +export TWINKLE_NPU_PATCH=0 + +# Disable FLA only while keeping other fused operators +export TWINKLE_NPU_FLA=0 + +# Disable MoE patch only +export TWINKLE_NPU_MOE_PATCH=0 +``` + ## Quick Start **Important Notice**: The following examples are from the `cookbook/` directory and have been verified in actual NPU environments. It is recommended to run scripts directly from the cookbook rather than copying and pasting code snippets. diff --git "a/docs/source_zh/\344\275\277\347\224\250\346\214\207\345\274\225/NPU\347\232\204\346\224\257\346\214\201.md" "b/docs/source_zh/\344\275\277\347\224\250\346\214\207\345\274\225/NPU\347\232\204\346\224\257\346\214\201.md" index 39f6fe18..9d4a9905 100644 --- "a/docs/source_zh/\344\275\277\347\224\250\346\214\207\345\274\225/NPU\347\232\204\346\224\257\346\214\201.md" +++ "b/docs/source_zh/\344\275\277\347\224\250\346\214\207\345\274\225/NPU\347\232\204\346\224\257\346\214\201.md" @@ -158,6 +158,61 @@ export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ```bash python -c "import mindspeed.megatron_adaptor; from twinkle.model.megatron._mindspeed_runtime import ensure_mindspeed_adaptor_patched; ensure_mindspeed_adaptor_patched(); print('✓ Megatron backend imports are ready')" ``` +### 6. Qwen3.5/3.6 FLA 与 Triton-Ascend 版本配套 + +**FLA 开启条件** + +Qwen3.5/3.6 在 transformers 后端使用 FLA(Flash Linear Attention)时,需要满足以下条件: + +- 安装 `triton-ascend` +- `mindspeed` 版本为 `26.0.0_core_r0.12.1` + +**Triton-Ascend 版本与 CANN 配套** + +| triton-ascend | CANN | 额外依赖 | +| --- | --- | --- | +| 3.2.0 | 8.5.x | 不需要安装 `triton` | +| 3.2.1 | 9.0.0 | 需要安装 `triton` | + +**MindSpeed 版本与代码适配** + +当前验证的 MindSpeed 版本为 `26.0.0_core_r0.12.1`。MindSpeed 代码仓地址:[https://gitcode.com/Ascend/MindSpeed](https://gitcode.com/Ascend/MindSpeed) + +如使用更高版本 MindSpeed,需注意 `src/twinkle/kernel/chunk_gated_delta_rule.py` 中的以下导入路径可能需要对应 MindSpeed 实际代码位置进行修改: + +```python +from mindspeed.lite.ops.triton.chunk_delta_h import chunk_gated_delta_rule_bwd_dhu, chunk_gated_delta_rule_fwd_h +from mindspeed.lite.ops.triton.chunk_o import chunk_bwd_dqkwg, chunk_bwd_dv_local, chunk_fwd_o +from mindspeed.lite.ops.triton.chunk_scaled_dot_kkt import chunk_scaled_dot_kkt_fwd +from mindspeed.lite.ops.triton.cumsum import chunk_local_cumsum +from mindspeed.lite.ops.triton.solve_tril import solve_tril +from mindspeed.lite.ops.triton.utils import autocast_custom_bwd, autocast_custom_fwd, input_guard +from mindspeed.lite.ops.triton.wy_fast import prepare_wy_repr_bwd, recompute_w_u_fwd +``` + +### 7. NPU Patch 环境变量配置 + +Twinkle 在 NPU 环境下默认启用模型层补丁,可通过以下环境变量进行细粒度控制: + +| 环境变量 | 说明 | 默认值 | +| --- | --- | --- | +| `TWINKLE_NPU_PATCH` | 所有 NPU 优化的总开关 | `1`(启用) | +| `TWINKLE_NPU_FUSED_OPS` | 启用融合算子(RMSNorm、RoPE、SwiGLU、SDPA) | `1`(启用) | +| `TWINKLE_NPU_MOE_PATCH` | 启用 MoE Grouped MatMul | `1`(启用) | +| `TWINKLE_NPU_FLA` | 启用 Qwen3.5 Flash Linear Attention;设为 `0` 强制回退到 torch 实现 | `1`(启用) | + +**使用示例**: + +```bash +# 关闭所有 NPU 优化,回退到 Transformers 原生实现 +export TWINKLE_NPU_PATCH=0 + +# 仅关闭 FLA,保留其他融合算子 +export TWINKLE_NPU_FLA=0 + +# 仅关闭 MoE 补丁 +export TWINKLE_NPU_MOE_PATCH=0 +``` ## 快速开始