Skip to content

chore: Enrol can clear bios jobs#1979

Open
stevekeay wants to merge 4 commits into
mainfrom
enrol-clear-bios-jobs
Open

chore: Enrol can clear bios jobs#1979
stevekeay wants to merge 4 commits into
mainfrom
enrol-clear-bios-jobs

Conversation

@stevekeay
Copy link
Copy Markdown
Contributor

@stevekeay stevekeay commented Apr 23, 2026

(1) It's quite easy to create "conflicting" jobs, so that a stale job that was left behind prevents any progress because the new job can't be created until the old one is deleted.

This just clears all jobs on startup. I know there is an ironic hook that ships with ironic and therefore is superior to adding code here, but running it here makes it run at the right time, when we need it to.

We also add an option to call the ironic "known_good_state" which reboots the idrac, just to make sure. Unfortunately this is broken in our Ironic because it relies on the "ping" binary (like we are still in 2006) that is not present in our container. It still resets the drac but then it errors out, forcing the user to unset maintenance mode, wait a while, and try again.

(2) If we are to update BIOS settings, we go ahead and take a second reboot, to ensure the BIOS settings change gets committed before we take on anything else. This takes a very long time, but trying to stack the bios update alongside other updates seems to result in failures. This change makes the process slower, but more likely to work first time.

(3) disable some more PXE devices - these can get left enabled by prior processes and confuse the boot process output. We're not using PXE so disable it, period.

@stevekeay stevekeay force-pushed the enrol-clear-bios-jobs branch from f957cea to 6416295 Compare April 23, 2026 19:31
@stevekeay stevekeay changed the title feat: Enrol can clear bios jobs chore: Enrol can clear bios jobs Apr 23, 2026
We're seeing obscure errors from Ironic like:

failed step {'interface': 'raid', 'step': 'delete_configuration',
'abortable': False, 'priority': 0}: Unable to connect to
/redfish/v1/TaskService/Tasks/JID_768614980495. Error: Timeout waiting
for task monitor /redfish/v1/TaskService/Tasks/JID_768614980495 (timeout
= 500)

To clear this up, we are completing each operation with a separate
reboot.
@stevekeay stevekeay force-pushed the enrol-clear-bios-jobs branch from 6416295 to 3e1c22f Compare April 23, 2026 19:59
@stevekeay stevekeay requested a review from a team April 27, 2026 19:22
@cardoe
Copy link
Copy Markdown
Contributor

cardoe commented May 5, 2026

@stevekeay Are you wanting to call known_good_state in the normal path? It looks like it just resets the iDRAC and then clears the job queue per https://github.com/openstack/ironic/blob/1114d43ca8cde969e87dd99f5c70cc9c5a57fe93/ironic/drivers/modules/drac/management.py#L161

@stevekeay
Copy link
Copy Markdown
Contributor Author

@stevekeay Are you wanting to call known_good_state in the normal path? It looks like it just resets the iDRAC and then clears the job queue per https://github.com/openstack/ironic/blob/1114d43ca8cde969e87dd99f5c70cc9c5a57fe93/ironic/drivers/modules/drac/management.py#L161

The plan was to do that ONLY on explicit request by a command line flag.

known_good_state doesn't do any harm but it adds a lot of time to a process which is already very slow. If we find ourselves needing to do it a lot then we can make it the default.

@cardoe
Copy link
Copy Markdown
Contributor

cardoe commented May 6, 2026

@stevekeay Are you wanting to call known_good_state in the normal path? It looks like it just resets the iDRAC and then clears the job queue per https://github.com/openstack/ironic/blob/1114d43ca8cde969e87dd99f5c70cc9c5a57fe93/ironic/drivers/modules/drac/management.py#L161

The plan was to do that ONLY on explicit request by a command line flag.

known_good_state doesn't do any harm but it adds a lot of time to a process which is already very slow. If we find ourselves needing to do it a lot then we can make it the default.

Well before making it the default we'd replace the ping call with hitting the redfish API endpoint in a backoff timeout.

Copy link
Copy Markdown
Contributor

@cardoe cardoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why the agent inspection here a second time vs the other location?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants