Support fuzzing on GCE by pimyn-girgis · Pull Request #6905 · google/syzkaller

pimyn-girgis · 2026-03-10T15:54:57Z

Before sending a pull request, please review Contribution Guidelines:
https://github.com/google/syzkaller/blob/master/docs/contributing.md

a-nogikh

(Reviewed not all commits yet)

Please send the first commit as a separate PR, we'll merge it right away.

Regarding syz-cluster integration: there's one that we need to double-check - bootable disk images. For qemu, we only need a generic disk image (the buildroot one) + the kernel image (the bzImage file). And we do definitely pass on the kernel image file form the build step to the boot/fuzz/retest actions + the syz-manager configs correctly reference it.

For gce, a kernel image must be incorporated into the disk image (/pkg/build takes care of that). But it means that we should take the image out of pkg/build.Image and make sure that it's also passed on to other action and that all gce manager configs reference it and not the generic buildroot image.

Another action item here is to make sure that boot/fuzz/test pods (to be more precise, the service accounts they use) have the permissions to use the GCS and GCE APIs.

Even better (but not critical, we can skip it), if they only have these permissions if VM type is gce.

One more thing is that it makes no sense to request a lot of CPUs and RAM for the pods with gce VM type (afaik we set it in the argo workflow templates). Also such pods do not have to stay on the pool of machines with nested virtualization support:

syzkaller/syz-cluster/overlays/gke/common/kustomization.yaml

Lines 13 to 32 in 86914af

    
           patches: 
        
             - target: 
        
                 kind: WorkflowTemplate 
        
                 name: (boot|fuzz)-action-template 
        
               patch: |- 
        
                 - op: replace 
        
                   path: /spec/templates/0/tolerations 
        
                   value: 
        
                       - key: "workload" 
        
                         operator: "Equal" 
        
                         value: "nested-vm" 
        
                         effect: "NoSchedule" 
        
             - target: 
        
                 kind: WorkflowTemplate 
        
                 name: (boot|fuzz)-action-template 
        
               patch: |- 
        
                 - op: replace 
        
                   path: /spec/templates/0/nodeSelector 
        
                   value: 
        
                     amd64-nested-virtualization: "true"

Yet another point to take care of is that now syz-manager names does begin to be important as they will be used for the GCE images that would be created and as a prefix for the transient fuzzing VMs. It must be unique.

a-nogikh · 2026-03-10T22:53:38Z

 	SkipCoverCheck bool `json:"skip_cover_check"`
 	// Only report the bugs that match the regexp.
 	BugTitleRe string `json:"bug_title_re"`
+	VMType     string `json:"vm_type"`


** Thinking out loud **

The structure has become a bit too broad - it has both the fields that we need for the syz-manager config (Focus, VMType, later maybe Arch as well) and the fields we only need for focused fuzzing (CorpusURLs, etc).

Maybe we need a separate structure like SyzkallerConfig or EnvConfig to only include the information truly relevant for running a syz-manager --mode=smoke or pkg/manager/diff. And then nest it into both FuzzConfig (which can become FuzzTask) and RetestTask.

Wdyt?

a-nogikh · 2026-03-10T22:58:21Z

+var qemuBaseConfigJSON []byte
+
+//go:embed gce_base.cfg
+var gceBaseConfigJSON []byte


Would be nice to have at least some gce tests (see testdata).

a-nogikh · 2026-03-11T12:39:15Z

Please split off pkg/gce, vm/gce: auto delete instances if VMRunningTime is set in config to a separate commit. It would be very nice to test it locally before we merge it, e.g. by setting up a syzkaller instance on GCP and letting it run for some time.

(And note that we typically use shorter commit titles)

Create a new config option `auto_delete_image` for GCE VMs, that signals an image should be deleted when it's no longer used.

Propagate VM type configuration throughout the syz-cluster pipeline to allow fuzzing on GCE in addition to QEMU.

Update Argo workflow templates to pass the full fuzzing configuration as an artifact to boot and retest actions.

We don't need very performant VMs since we will switch to GCE fuzzing. Introduce two profiles and choose based on VM type.

pimyn-girgis · 2026-04-08T08:56:24Z

@a-nogikh PTAL

I still need to:

check that SAs have permission to use required GCS and GCE APIs.
check that bootable disk images are referenced correctly
add gce tests for fuzzconfig/generate.go
address the comment about separating structs

a-nogikh

One random idea (idk if it's better or not).

Instead of random names, we could maintain a poll of reusable images and GCS files. E.g. if a manager is named ci-upstream, it would look up ci-upstream-image-0, ci-upstream-image-1 and check their last_active tags.

If GCE API allows to determine concurrent label updates (so that if two managers change labels concurrently, only one wins), we can let them contest for the images.

In that case, we wouldn't really need to delete any other images, only the one we've managed to grab (?)

a-nogikh · 2026-04-08T11:50:35Z

          requests:
-            cpu: 24
-            memory: 90G
+            cpu: "{{=inputs.parameters.profile == 'efficient' ? '2' : '24'}}"


Nit: it's not really about efficiency :)
Let's maybe it just be gce / qemu?

In this case we also should not demand that the pod is scheduled specifically on the node pool with nested vm support.

Same for the retest and boot steps.

I wanted to decouple the vm type from the machine capabilities, only using it in the first stage and propagating it further. What do you think of profile names like "low" and "high"?
I'm also not very fixated on decoupling, so if you prefer, I'll go with qemu/gce.

a-nogikh · 2026-04-08T11:55:13Z

+    "vm": {
+      "count": 3,
+      "machine_type": "e2-standard-2",
+      "gcs_path": "syzkaller/disks",


This path should be configurable somewhere, as it depends on the used GCP project.

a-nogikh · 2026-04-08T12:00:57Z

+	return qemuBaseConfigJSON
+}
+
+func ReadFuzzConfig(config string) *api.FuzzConfig {


Shall we then also switch fuzz and retest to use it? fuzz still has readFuzzConfig.

Nit: I think it could be called ReadFromFile as it's evident from the package name that it's about fuzz config.

a-nogikh · 2026-04-08T12:48:33Z

 		var err error
-		mgrCfg.VM, err = config.MergeJSONs(mgrCfg.VM, []byte(
-			`{"qemu_args": "-machine q35,nvdimm=on,accel=kvm,kernel-irqchip=split -cpu max,migratable=off -enable-kvm -smp 2,sockets=2,cores=1"}`))
+		if mgrCfg.Type == "qemu" {


For gce, we may want to change the machine type from e2 to some that supports nested virtualization. AFAIK it's either n2 or n2d (needs double-checking).

a-nogikh · 2026-04-08T12:53:27Z

+			return nil,
+				SkipError(fmt.Sprintf("only gce and qemu vms are supported now. %s is not supported", target.FuzzConfig.VMType))
+		}
+		if result.Arch != "amd64" {


Maybe move it to https://github.com/google/syzkaller/blob/master/syz-cluster/pkg/app/config.go's validation checks?
The arch ultimately comes from Trees/FuzzTargets anyway.

a-nogikh · 2026-04-08T12:56:29Z

Would be very nice to see some gce tests in testdata/.

a-nogikh · 2026-04-08T13:10:06Z

 		return "", err
 	}

+	// Attempt 3 times to update the label.


I'm hesitating whether we want to renew the image label on VM creation.

In syz-manager, you could pause and resume its execution:

syzkaller/pkg/manager/http.go

Line 137 in d9b7f62

case "pause":

If the manager is paused for longer than VMRunningTime, session will break.

a-nogikh · 2026-04-08T13:18:33Z

+}
+
+func (ctx *Context) DeleteImageWithRequestID(imageName, requestID string) error {
+	// A request ID ensures that if two managers send the same request, deletion will only be attempted once.


What error would we get in that case? If one manager would get success and the other something like "operation in progress" or "not found", maybe it's okay to just tolerate those errors?

a-nogikh · 2026-04-08T13:24:17Z

+    "vm": {
+      "count": 3,
+      "machine_type": "e2-standard-2",
+      "gcs_path": "syzkaller/disks",


One potential problem to double-check: AFAIK we are first uploading the image to the GCS bucket on gcs_path. Do we use manager's name for the file?

In that case, we could accumulate there a lot of garbage.

pimyn-girgis requested a review from a-nogikh March 10, 2026 15:54

a-nogikh reviewed Mar 10, 2026

View reviewed changes

pimyn-girgis mentioned this pull request Mar 11, 2026

pkg/gce, vm/gce: auto delete instances if VMRunningTime is set in config #6842

Closed

pimyn-girgis mentioned this pull request Mar 11, 2026

pkg/gce, vm/gce: auto delete instances #6921

Merged

pimyn-girgis force-pushed the gce branch 4 times, most recently from 1ea3358 to c4d285c Compare March 26, 2026 10:05

pimyn-girgis force-pushed the gce branch 2 times, most recently from 1f168b8 to ab1b8e2 Compare April 7, 2026 09:18

pimyn-girgis added 8 commits April 8, 2026 10:50

pkg/gce: support image deletion with request IDs

f48539a

pkg/gce: support auto-deletion of gce images

6e0ed61

Create a new config option `auto_delete_image` for GCE VMs, that signals an image should be deleted when it's no longer used.

syz-cluster/pkg/fuzzconfig: support gce config

ab59cc4

syz-cluster: support GCE VMs in workflows

97d8db5

Propagate VM type configuration throughout the syz-cluster pipeline to allow fuzzing on GCE in addition to QEMU.

syz-cluster: use fuzzing configuration artifact in workflows

888c82b

Update Argo workflow templates to pass the full fuzzing configuration as an artifact to boot and retest actions.

syz-cluster: persist vm type in Builds table

c0752ec

syz-cluster: add effecient and performant workflow profiles

0637056

We don't need very performant VMs since we will switch to GCE fuzzing. Introduce two profiles and choose based on VM type.

syz-cluster/pkg/fuzzconfig: ensure unique VM names using UUID

5451f72

pimyn-girgis force-pushed the gce branch from ab1b8e2 to 5451f72 Compare April 8, 2026 08:53

a-nogikh reviewed Apr 8, 2026

View reviewed changes

	patches:
	- target:
	kind: WorkflowTemplate
	name: (boot\|fuzz)-action-template
	patch: \|-
	- op: replace
	path: /spec/templates/0/tolerations
	value:
	- key: "workload"
	operator: "Equal"
	value: "nested-vm"
	effect: "NoSchedule"
	- target:
	kind: WorkflowTemplate
	name: (boot\|fuzz)-action-template
	patch: \|-
	- op: replace
	path: /spec/templates/0/nodeSelector
	value:
	amd64-nested-virtualization: "true"

Conversation

pimyn-girgis commented Mar 10, 2026

Uh oh!

a-nogikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a-nogikh commented Mar 11, 2026

Uh oh!

pimyn-girgis commented Apr 8, 2026

Uh oh!

a-nogikh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants