Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ our [documentation](docs/source_en/Usage%20Guide/Train-as-a-Service.md).
| Hardware Environment | Notes |
| -------------------- | ---------------------------------------------------------------- |
| Nvidia GPUs | ✅ Support for BF16/Flash-Attn may be incomplete in earlier GPUs |
| Ascend NPU | ✅ Some operators may not be supported |
| Ascend NPU | ✅ FP8 is not supported on A2 and A3 due to hardware limitations |
| PPU | ✅ |
| CPU | Supports partial components like dataset, dataloader |

Expand Down
2 changes: 1 addition & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ Twinkle✨支持相同的算法接口运行在单GPU、torchrun多机、Ray、Cl
| 硬件环境 | 备注 |
| -------- | --------------------------------------------------------------- |
| Nvidia GPU | ✅ 早期 GPU 对 BF16/Flash-Attn 的支持可能不完整 |
| 昇腾 NPU | ✅ 部分算子可能不支持 |
| 昇腾 NPU | ✅ 由于硬件限制A2、A3暂不支持FP8 |
| PPU | ✅ |
| CPU | 支持部分组件如 dataset、dataloader |

Expand Down
56 changes: 56 additions & 0 deletions docs/source_en/Usage Guide/NPU-Support.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,62 @@ First run a minimal import check to make sure the current environment can resolv
python -c "import mindspeed.megatron_adaptor; from twinkle.model.megatron._mindspeed_runtime import ensure_mindspeed_adaptor_patched; ensure_mindspeed_adaptor_patched(); print('✓ Megatron backend imports are ready')"
```

### 6. Qwen3.5/3.6 FLA and Triton-Ascend Version Compatibility

**FLA Enablement Conditions**

To use FLA (Flash Linear Attention) with Qwen3.5/3.6 on the transformers backend, the following conditions must be met:

- Install `triton-ascend`
- `mindspeed` version `26.0.0_core_r0.12.1`
Comment thread
ys2025-AI marked this conversation as resolved.

**Triton-Ascend Version and CANN Compatibility**

| triton-ascend | CANN | Additional Dependencies |
| --- | --- | --- |
| 3.2.0 | 8.5.x | Do not install `triton` |
| 3.2.1 | 9.0.0 | `triton` must be installed |

**MindSpeed Version and Code Adaptation**

The currently validated MindSpeed version is `26.0.0_core_r0.12.1`. MindSpeed repository: [https://gitcode.com/Ascend/MindSpeed](https://gitcode.com/Ascend/MindSpeed)

If using a higher MindSpeed version, note that the following import paths in `src/twinkle/kernel/chunk_gated_delta_rule.py` may need to be adjusted to match the actual code locations in MindSpeed:

```python
from mindspeed.lite.ops.triton.chunk_delta_h import chunk_gated_delta_rule_bwd_dhu, chunk_gated_delta_rule_fwd_h
from mindspeed.lite.ops.triton.chunk_o import chunk_bwd_dqkwg, chunk_bwd_dv_local, chunk_fwd_o
from mindspeed.lite.ops.triton.chunk_scaled_dot_kkt import chunk_scaled_dot_kkt_fwd
from mindspeed.lite.ops.triton.cumsum import chunk_local_cumsum
from mindspeed.lite.ops.triton.solve_tril import solve_tril
from mindspeed.lite.ops.triton.utils import autocast_custom_bwd, autocast_custom_fwd, input_guard
from mindspeed.lite.ops.triton.wy_fast import prepare_wy_repr_bwd, recompute_w_u_fwd
```

### 7. NPU Patch Environment Variable Configuration

Twinkle enables model-layer patches by default in NPU environments. The following environment variables provide fine-grained control:

| Environment Variable | Description | Default |
| --- | --- | --- |
| `TWINKLE_NPU_PATCH` | Master switch for all NPU optimizations | `1` (enabled) |
| `TWINKLE_NPU_FUSED_OPS` | Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA) | `1` (enabled) |
| `TWINKLE_NPU_MOE_PATCH` | Enable MoE Grouped MatMul | `1` (enabled) |
| `TWINKLE_NPU_FLA` | Enable Qwen3.5 Flash Linear Attention; set to `0` to force torch fallback | `1` (enabled) |

**Usage examples**:

```bash
# Disable all NPU optimizations and fall back to native Transformers
export TWINKLE_NPU_PATCH=0

# Disable FLA only while keeping other fused operators
export TWINKLE_NPU_FLA=0

# Disable MoE patch only
export TWINKLE_NPU_MOE_PATCH=0
```

## Quick Start

**Important Notice**: The following examples are from the `cookbook/` directory and have been verified in actual NPU environments. It is recommended to run scripts directly from the cookbook rather than copying and pasting code snippets.
Expand Down
55 changes: 55 additions & 0 deletions docs/source_zh/使用指引/NPU的支持.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,61 @@ export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
```bash
python -c "import mindspeed.megatron_adaptor; from twinkle.model.megatron._mindspeed_runtime import ensure_mindspeed_adaptor_patched; ensure_mindspeed_adaptor_patched(); print('✓ Megatron backend imports are ready')"
```
### 6. Qwen3.5/3.6 FLA 与 Triton-Ascend 版本配套

**FLA 开启条件**

Qwen3.5/3.6 在 transformers 后端使用 FLA(Flash Linear Attention)时,需要满足以下条件:

- 安装 `triton-ascend`
- `mindspeed` 版本为 `26.0.0_core_r0.12.1`

**Triton-Ascend 版本与 CANN 配套**

| triton-ascend | CANN | 额外依赖 |
| --- | --- | --- |
| 3.2.0 | 8.5.x | 不需要安装 `triton` |
| 3.2.1 | 9.0.0 | 需要安装 `triton` |

**MindSpeed 版本与代码适配**

当前验证的 MindSpeed 版本为 `26.0.0_core_r0.12.1`。MindSpeed 代码仓地址:[https://gitcode.com/Ascend/MindSpeed](https://gitcode.com/Ascend/MindSpeed)

如使用更高版本 MindSpeed,需注意 `src/twinkle/kernel/chunk_gated_delta_rule.py` 中的以下导入路径可能需要对应 MindSpeed 实际代码位置进行修改:

```python
from mindspeed.lite.ops.triton.chunk_delta_h import chunk_gated_delta_rule_bwd_dhu, chunk_gated_delta_rule_fwd_h
from mindspeed.lite.ops.triton.chunk_o import chunk_bwd_dqkwg, chunk_bwd_dv_local, chunk_fwd_o
from mindspeed.lite.ops.triton.chunk_scaled_dot_kkt import chunk_scaled_dot_kkt_fwd
from mindspeed.lite.ops.triton.cumsum import chunk_local_cumsum
from mindspeed.lite.ops.triton.solve_tril import solve_tril
from mindspeed.lite.ops.triton.utils import autocast_custom_bwd, autocast_custom_fwd, input_guard
from mindspeed.lite.ops.triton.wy_fast import prepare_wy_repr_bwd, recompute_w_u_fwd
```

### 7. NPU Patch 环境变量配置

Twinkle 在 NPU 环境下默认启用模型层补丁,可通过以下环境变量进行细粒度控制:

| 环境变量 | 说明 | 默认值 |
| --- | --- | --- |
| `TWINKLE_NPU_PATCH` | 所有 NPU 优化的总开关 | `1`(启用) |
| `TWINKLE_NPU_FUSED_OPS` | 启用融合算子(RMSNorm、RoPE、SwiGLU、SDPA) | `1`(启用) |
| `TWINKLE_NPU_MOE_PATCH` | 启用 MoE Grouped MatMul | `1`(启用) |
| `TWINKLE_NPU_FLA` | 启用 Qwen3.5 Flash Linear Attention;设为 `0` 强制回退到 torch 实现 | `1`(启用) |

**使用示例**:

```bash
# 关闭所有 NPU 优化,回退到 Transformers 原生实现
export TWINKLE_NPU_PATCH=0

# 仅关闭 FLA,保留其他融合算子
export TWINKLE_NPU_FLA=0

# 仅关闭 MoE 补丁
export TWINKLE_NPU_MOE_PATCH=0
```

## 快速开始

Expand Down
Loading