Skip to content

Support fuzzing on GCE#6905

Draft
pimyn-girgis wants to merge 8 commits intogoogle:masterfrom
pimyn-girgis:gce
Draft

Support fuzzing on GCE#6905
pimyn-girgis wants to merge 8 commits intogoogle:masterfrom
pimyn-girgis:gce

Conversation

@pimyn-girgis
Copy link
Copy Markdown
Collaborator


Before sending a pull request, please review Contribution Guidelines:
https://github.com/google/syzkaller/blob/master/docs/contributing.md


@pimyn-girgis pimyn-girgis requested a review from a-nogikh March 10, 2026 15:54
Copy link
Copy Markdown
Collaborator

@a-nogikh a-nogikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reviewed not all commits yet)

Please send the first commit as a separate PR, we'll merge it right away.


Regarding syz-cluster integration: there's one that we need to double-check - bootable disk images. For qemu, we only need a generic disk image (the buildroot one) + the kernel image (the bzImage file). And we do definitely pass on the kernel image file form the build step to the boot/fuzz/retest actions + the syz-manager configs correctly reference it.

For gce, a kernel image must be incorporated into the disk image (/pkg/build takes care of that). But it means that we should take the image out of pkg/build.Image and make sure that it's also passed on to other action and that all gce manager configs reference it and not the generic buildroot image.


Another action item here is to make sure that boot/fuzz/test pods (to be more precise, the service accounts they use) have the permissions to use the GCS and GCE APIs.

Even better (but not critical, we can skip it), if they only have these permissions if VM type is gce.

One more thing is that it makes no sense to request a lot of CPUs and RAM for the pods with gce VM type (afaik we set it in the argo workflow templates). Also such pods do not have to stay on the pool of machines with nested virtualization support:

patches:
- target:
kind: WorkflowTemplate
name: (boot|fuzz)-action-template
patch: |-
- op: replace
path: /spec/templates/0/tolerations
value:
- key: "workload"
operator: "Equal"
value: "nested-vm"
effect: "NoSchedule"
- target:
kind: WorkflowTemplate
name: (boot|fuzz)-action-template
patch: |-
- op: replace
path: /spec/templates/0/nodeSelector
value:
amd64-nested-virtualization: "true"


Yet another point to take care of is that now syz-manager names does begin to be important as they will be used for the GCE images that would be created and as a prefix for the transient fuzzing VMs. It must be unique.

Comment thread syz-cluster/workflow/boot/main.go Outdated
Comment thread syz-cluster/workflow/build/main.go Outdated
Comment thread syz-cluster/pkg/api/api.go
SkipCoverCheck bool `json:"skip_cover_check"`
// Only report the bugs that match the regexp.
BugTitleRe string `json:"bug_title_re"`
VMType string `json:"vm_type"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** Thinking out loud **

The structure has become a bit too broad - it has both the fields that we need for the syz-manager config (Focus, VMType, later maybe Arch as well) and the fields we only need for focused fuzzing (CorpusURLs, etc).

Maybe we need a separate structure like SyzkallerConfig or EnvConfig to only include the information truly relevant for running a syz-manager --mode=smoke or pkg/manager/diff. And then nest it into both FuzzConfig (which can become FuzzTask) and RetestTask.

Wdyt?

Comment thread syz-cluster/workflow/retest/main.go Outdated
Comment thread syz-cluster/workflow/triage/main.go Outdated
Comment thread syz-cluster/pkg/fuzzconfig/gce_base.cfg Outdated
var qemuBaseConfigJSON []byte

//go:embed gce_base.cfg
var gceBaseConfigJSON []byte
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have at least some gce tests (see testdata).

@a-nogikh
Copy link
Copy Markdown
Collaborator

Please split off pkg/gce, vm/gce: auto delete instances if VMRunningTime is set in config to a separate commit. It would be very nice to test it locally before we merge it, e.g. by setting up a syzkaller instance on GCP and letting it run for some time.

(And note that we typically use shorter commit titles)

Create a new config option `auto_delete_image` for GCE VMs, that signals an image should be deleted when it's no longer
used.
Propagate VM type configuration throughout the syz-cluster pipeline to allow fuzzing on GCE in addition to QEMU.
Update Argo workflow templates to pass the full fuzzing configuration as an artifact to boot and retest actions.
We don't need very performant VMs since we will switch to GCE fuzzing. Introduce two profiles and choose based on VM
type.
@pimyn-girgis
Copy link
Copy Markdown
Collaborator Author

@a-nogikh PTAL

I still need to:

  1. check that SAs have permission to use required GCS and GCE APIs.
  2. check that bootable disk images are referenced correctly
  3. add gce tests for fuzzconfig/generate.go
  4. address the comment about separating structs

Copy link
Copy Markdown
Collaborator

@a-nogikh a-nogikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One random idea (idk if it's better or not).

Instead of random names, we could maintain a poll of reusable images and GCS files. E.g. if a manager is named ci-upstream, it would look up ci-upstream-image-0, ci-upstream-image-1 and check their last_active tags.

If GCE API allows to determine concurrent label updates (so that if two managers change labels concurrently, only one wins), we can let them contest for the images.

In that case, we wouldn't really need to delete any other images, only the one we've managed to grab (?)

requests:
cpu: 24
memory: 90G
cpu: "{{=inputs.parameters.profile == 'efficient' ? '2' : '24'}}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it's not really about efficiency :)
Let's maybe it just be gce / qemu?

In this case we also should not demand that the pod is scheduled specifically on the node pool with nested vm support.

Same for the retest and boot steps.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to decouple the vm type from the machine capabilities, only using it in the first stage and propagating it further. What do you think of profile names like "low" and "high"?
I'm also not very fixated on decoupling, so if you prefer, I'll go with qemu/gce.

"vm": {
"count": 3,
"machine_type": "e2-standard-2",
"gcs_path": "syzkaller/disks",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path should be configurable somewhere, as it depends on the used GCP project.

return qemuBaseConfigJSON
}

func ReadFuzzConfig(config string) *api.FuzzConfig {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we then also switch fuzz and retest to use it? fuzz still has readFuzzConfig.

Nit: I think it could be called ReadFromFile as it's evident from the package name that it's about fuzz config.

var err error
mgrCfg.VM, err = config.MergeJSONs(mgrCfg.VM, []byte(
`{"qemu_args": "-machine q35,nvdimm=on,accel=kvm,kernel-irqchip=split -cpu max,migratable=off -enable-kvm -smp 2,sockets=2,cores=1"}`))
if mgrCfg.Type == "qemu" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gce, we may want to change the machine type from e2 to some that supports nested virtualization. AFAIK it's either n2 or n2d (needs double-checking).

return nil,
SkipError(fmt.Sprintf("only gce and qemu vms are supported now. %s is not supported", target.FuzzConfig.VMType))
}
if result.Arch != "amd64" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move it to https://github.com/google/syzkaller/blob/master/syz-cluster/pkg/app/config.go's validation checks?
The arch ultimately comes from Trees/FuzzTargets anyway.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be very nice to see some gce tests in testdata/.

Comment thread pkg/gce/gce.go
return "", err
}

// Attempt 3 times to update the label.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitating whether we want to renew the image label on VM creation.

In syz-manager, you could pause and resume its execution:

case "pause":

If the manager is paused for longer than VMRunningTime, session will break.

Comment thread pkg/gce/gce.go
}

func (ctx *Context) DeleteImageWithRequestID(imageName, requestID string) error {
// A request ID ensures that if two managers send the same request, deletion will only be attempted once.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error would we get in that case? If one manager would get success and the other something like "operation in progress" or "not found", maybe it's okay to just tolerate those errors?

"vm": {
"count": 3,
"machine_type": "e2-standard-2",
"gcs_path": "syzkaller/disks",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential problem to double-check: AFAIK we are first uploading the image to the GCS bucket on gcs_path. Do we use manager's name for the file?

In that case, we could accumulate there a lot of garbage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants