chore: Enrol can clear bios jobs#1979
Conversation
f957cea to
6416295
Compare
We're seeing obscure errors from Ironic like:
failed step {'interface': 'raid', 'step': 'delete_configuration',
'abortable': False, 'priority': 0}: Unable to connect to
/redfish/v1/TaskService/Tasks/JID_768614980495. Error: Timeout waiting
for task monitor /redfish/v1/TaskService/Tasks/JID_768614980495 (timeout
= 500)
To clear this up, we are completing each operation with a separate
reboot.
6416295 to
3e1c22f
Compare
|
@stevekeay Are you wanting to call |
The plan was to do that ONLY on explicit request by a command line flag. known_good_state doesn't do any harm but it adds a lot of time to a process which is already very slow. If we find ourselves needing to do it a lot then we can make it the default. |
Well before making it the default we'd replace the ping call with hitting the redfish API endpoint in a backoff timeout. |
cardoe
left a comment
There was a problem hiding this comment.
It's not clear to me why the agent inspection here a second time vs the other location?
(1) It's quite easy to create "conflicting" jobs, so that a stale job that was left behind prevents any progress because the new job can't be created until the old one is deleted.
This just clears all jobs on startup. I know there is an ironic hook that ships with ironic and therefore is superior to adding code here, but running it here makes it run at the right time, when we need it to.
We also add an option to call the ironic "known_good_state" which reboots the idrac, just to make sure. Unfortunately this is broken in our Ironic because it relies on the "ping" binary (like we are still in 2006) that is not present in our container. It still resets the drac but then it errors out, forcing the user to unset maintenance mode, wait a while, and try again.
(2) If we are to update BIOS settings, we go ahead and take a second reboot, to ensure the BIOS settings change gets committed before we take on anything else. This takes a very long time, but trying to stack the bios update alongside other updates seems to result in failures. This change makes the process slower, but more likely to work first time.
(3) disable some more PXE devices - these can get left enabled by prior processes and confuse the boot process output. We're not using PXE so disable it, period.