Skip to content

Commit d1958dc

Browse files
Farah-kassabriogabbay
authored andcommitted
accel/habanalabs: fix EQ heartbeat mechanism
Stop rescheduling another heartbeat check when EQ heartbeat check fails as it generates confusing logs in dmesg that the heartbeat fails. Signed-off-by: Farah Kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
1 parent 4242299 commit d1958dc

1 file changed

Lines changed: 7 additions & 7 deletions

File tree

drivers/accel/habanalabs/common/device.c

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1044,20 +1044,21 @@ static bool is_pci_link_healthy(struct hl_device *hdev)
10441044
return (vendor_id == PCI_VENDOR_ID_HABANALABS);
10451045
}
10461046

1047-
static void hl_device_eq_heartbeat(struct hl_device *hdev)
1047+
static int hl_device_eq_heartbeat_check(struct hl_device *hdev)
10481048
{
1049-
u64 event_mask = HL_NOTIFIER_EVENT_DEVICE_RESET | HL_NOTIFIER_EVENT_DEVICE_UNAVAILABLE;
10501049
struct asic_fixed_properties *prop = &hdev->asic_prop;
10511050

10521051
if (!prop->cpucp_info.eq_health_check_supported)
1053-
return;
1052+
return 0;
10541053

10551054
if (hdev->eq_heartbeat_received) {
10561055
hdev->eq_heartbeat_received = false;
10571056
} else {
10581057
dev_err(hdev->dev, "EQ heartbeat event was not received!\n");
1059-
hl_device_cond_reset(hdev, HL_DRV_RESET_HARD, event_mask);
1058+
return -EIO;
10601059
}
1060+
1061+
return 0;
10611062
}
10621063

10631064
static void hl_device_heartbeat(struct work_struct *work)
@@ -1074,10 +1075,9 @@ static void hl_device_heartbeat(struct work_struct *work)
10741075
/*
10751076
* For EQ health check need to check if driver received the heartbeat eq event
10761077
* in order to validate the eq is working.
1078+
* Only if both the EQ is healthy and we managed to send the next heartbeat reschedule.
10771079
*/
1078-
hl_device_eq_heartbeat(hdev);
1079-
1080-
if (!hdev->asic_funcs->send_heartbeat(hdev))
1080+
if ((!hl_device_eq_heartbeat_check(hdev)) && (!hdev->asic_funcs->send_heartbeat(hdev)))
10811081
goto reschedule;
10821082

10831083
if (hl_device_operational(hdev, NULL))

0 commit comments

Comments
 (0)