Skip to content

Commit e2dcf90

Browse files
committed
xen/xenbus: better handle backend crash
When the backend domain crashes, coordinated device cleanup is not possible (as it involves waiting for the backend state change). In that case, toolstack forcefully removes frontend xenstore entries. xenbus_dev_changed() handles this case, and triggers device cleanup. It's possible that toolstack manages to connect new device in that place, before xenbus_dev_changed() notices the old one is missing. If that happens, new one won't be probed and will forever remain in XenbusStateInitialising. Fix this by checking the frontend's state in Xenstore. In case it has been reset to XenbusStateInitialising by Xen tools, consider this being the result of an unplug+plug operation. It's important that cleanup on such unplug doesn't modify Xenstore entries (especially the "state" key) as it belong to the new device to be probed - changing it would derail establishing connection to the new backend (most likely, closing the device before it was even connected). Handle this case by setting new xenbus_device->vanished flag to true, and check it before changing state entry. And even if xenbus_dev_changed() correctly detects the device was forcefully removed, the cleanup handling is still racy. Since this whole handling doesn't happened in a single Xenstore transaction, it's possible that toolstack might put a new device there already. Avoid re-creating the state key (which in the case of loosing the race would actually close newly attached device). The problem does not apply to frontend domain crash, as this case involves coordinated cleanup. Problem originally reported at https://lore.kernel.org/xen-devel/aOZvivyZ9YhVWDLN@mail-itl/T/#t, including reproduction steps. Based-on-patch-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260218095205.453657-3-jgross@suse.com>
1 parent 82169da commit e2dcf90

3 files changed

Lines changed: 48 additions & 2 deletions

File tree

drivers/xen/xenbus/xenbus_client.c

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -226,8 +226,9 @@ __xenbus_switch_state(struct xenbus_device *dev,
226226
struct xenbus_transaction xbt;
227227
int current_state;
228228
int err, abort;
229+
bool vanished = false;
229230

230-
if (state == dev->state)
231+
if (state == dev->state || dev->vanished)
231232
return 0;
232233

233234
again:
@@ -242,6 +243,10 @@ __xenbus_switch_state(struct xenbus_device *dev,
242243
err = xenbus_scanf(xbt, dev->nodename, "state", "%d", &current_state);
243244
if (err != 1)
244245
goto abort;
246+
if (current_state != dev->state && current_state == XenbusStateInitialising) {
247+
vanished = true;
248+
goto abort;
249+
}
245250

246251
err = xenbus_printf(xbt, dev->nodename, "state", "%d", state);
247252
if (err) {
@@ -256,7 +261,7 @@ __xenbus_switch_state(struct xenbus_device *dev,
256261
if (err == -EAGAIN && !abort)
257262
goto again;
258263
xenbus_switch_fatal(dev, depth, err, "ending transaction");
259-
} else
264+
} else if (!vanished)
260265
dev->state = state;
261266

262267
return 0;
@@ -941,6 +946,10 @@ enum xenbus_state xenbus_read_driver_state(const struct xenbus_device *dev,
941946
const char *path)
942947
{
943948
enum xenbus_state result;
949+
950+
if (dev && dev->vanished)
951+
return XenbusStateUnknown;
952+
944953
int err = xenbus_gather(XBT_NIL, path, "state", "%d", &result, NULL);
945954
if (err)
946955
result = XenbusStateUnknown;

drivers/xen/xenbus/xenbus_probe.c

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -444,6 +444,9 @@ static void xenbus_cleanup_devices(const char *path, struct bus_type *bus)
444444
info.dev = NULL;
445445
bus_for_each_dev(bus, NULL, &info, cleanup_dev);
446446
if (info.dev) {
447+
dev_warn(&info.dev->dev,
448+
"device forcefully removed from xenstore\n");
449+
info.dev->vanished = true;
447450
device_unregister(&info.dev->dev);
448451
put_device(&info.dev->dev);
449452
}
@@ -659,6 +662,39 @@ void xenbus_dev_changed(const char *node, struct xen_bus_type *bus)
659662
return;
660663

661664
dev = xenbus_device_find(root, &bus->bus);
665+
/*
666+
* Backend domain crash results in not coordinated frontend removal,
667+
* without going through XenbusStateClosing. If this is a new instance
668+
* of the same device Xen tools will have reset the state to
669+
* XenbusStateInitializing.
670+
* It might be that the backend crashed early during the init phase of
671+
* device setup, in which case the known state would have been
672+
* XenbusStateInitializing. So test the backend domid to match the
673+
* saved one. In case the new backend happens to have the same domid as
674+
* the old one, we can just carry on, as there is no inconsistency
675+
* resulting in this case.
676+
*/
677+
if (dev && !strcmp(bus->root, "device")) {
678+
enum xenbus_state state = xenbus_read_driver_state(dev, dev->nodename);
679+
unsigned int backend = xenbus_read_unsigned(root, "backend-id",
680+
dev->otherend_id);
681+
682+
if (state == XenbusStateInitialising &&
683+
(state != dev->state || backend != dev->otherend_id)) {
684+
/*
685+
* State has been reset, assume the old one vanished
686+
* and new one needs to be probed.
687+
*/
688+
dev_warn(&dev->dev,
689+
"state reset occurred, reconnecting\n");
690+
dev->vanished = true;
691+
}
692+
if (dev->vanished) {
693+
device_unregister(&dev->dev);
694+
put_device(&dev->dev);
695+
dev = NULL;
696+
}
697+
}
662698
if (!dev)
663699
xenbus_probe_node(bus, type, root);
664700
else

include/xen/xenbus.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ struct xenbus_device {
8080
const char *devicetype;
8181
const char *nodename;
8282
const char *otherend;
83+
bool vanished;
8384
int otherend_id;
8485
struct xenbus_watch otherend_watch;
8586
struct device dev;

0 commit comments

Comments
 (0)