Conversation
a-nogikh
left a comment
There was a problem hiding this comment.
(Reviewed not all commits yet)
Please send the first commit as a separate PR, we'll merge it right away.
Regarding syz-cluster integration: there's one that we need to double-check - bootable disk images. For qemu, we only need a generic disk image (the buildroot one) + the kernel image (the bzImage file). And we do definitely pass on the kernel image file form the build step to the boot/fuzz/retest actions + the syz-manager configs correctly reference it.
For gce, a kernel image must be incorporated into the disk image (/pkg/build takes care of that). But it means that we should take the image out of pkg/build.Image and make sure that it's also passed on to other action and that all gce manager configs reference it and not the generic buildroot image.
Another action item here is to make sure that boot/fuzz/test pods (to be more precise, the service accounts they use) have the permissions to use the GCS and GCE APIs.
Even better (but not critical, we can skip it), if they only have these permissions if VM type is gce.
One more thing is that it makes no sense to request a lot of CPUs and RAM for the pods with gce VM type (afaik we set it in the argo workflow templates). Also such pods do not have to stay on the pool of machines with nested virtualization support:
syzkaller/syz-cluster/overlays/gke/common/kustomization.yaml
Lines 13 to 32 in 86914af
Yet another point to take care of is that now syz-manager names does begin to be important as they will be used for the GCE images that would be created and as a prefix for the transient fuzzing VMs. It must be unique.
| SkipCoverCheck bool `json:"skip_cover_check"` | ||
| // Only report the bugs that match the regexp. | ||
| BugTitleRe string `json:"bug_title_re"` | ||
| VMType string `json:"vm_type"` |
There was a problem hiding this comment.
** Thinking out loud **
The structure has become a bit too broad - it has both the fields that we need for the syz-manager config (Focus, VMType, later maybe Arch as well) and the fields we only need for focused fuzzing (CorpusURLs, etc).
Maybe we need a separate structure like SyzkallerConfig or EnvConfig to only include the information truly relevant for running a syz-manager --mode=smoke or pkg/manager/diff. And then nest it into both FuzzConfig (which can become FuzzTask) and RetestTask.
Wdyt?
| var qemuBaseConfigJSON []byte | ||
|
|
||
| //go:embed gce_base.cfg | ||
| var gceBaseConfigJSON []byte |
There was a problem hiding this comment.
Would be nice to have at least some gce tests (see testdata).
|
Please split off (And note that we typically use shorter commit titles) |
1ea3358 to
c4d285c
Compare
1f168b8 to
ab1b8e2
Compare
Create a new config option `auto_delete_image` for GCE VMs, that signals an image should be deleted when it's no longer used.
Propagate VM type configuration throughout the syz-cluster pipeline to allow fuzzing on GCE in addition to QEMU.
Update Argo workflow templates to pass the full fuzzing configuration as an artifact to boot and retest actions.
We don't need very performant VMs since we will switch to GCE fuzzing. Introduce two profiles and choose based on VM type.
|
@a-nogikh PTAL I still need to:
|
a-nogikh
left a comment
There was a problem hiding this comment.
One random idea (idk if it's better or not).
Instead of random names, we could maintain a poll of reusable images and GCS files. E.g. if a manager is named ci-upstream, it would look up ci-upstream-image-0, ci-upstream-image-1 and check their last_active tags.
If GCE API allows to determine concurrent label updates (so that if two managers change labels concurrently, only one wins), we can let them contest for the images.
In that case, we wouldn't really need to delete any other images, only the one we've managed to grab (?)
| requests: | ||
| cpu: 24 | ||
| memory: 90G | ||
| cpu: "{{=inputs.parameters.profile == 'efficient' ? '2' : '24'}}" |
There was a problem hiding this comment.
Nit: it's not really about efficiency :)
Let's maybe it just be gce / qemu?
In this case we also should not demand that the pod is scheduled specifically on the node pool with nested vm support.
Same for the retest and boot steps.
There was a problem hiding this comment.
I wanted to decouple the vm type from the machine capabilities, only using it in the first stage and propagating it further. What do you think of profile names like "low" and "high"?
I'm also not very fixated on decoupling, so if you prefer, I'll go with qemu/gce.
| "vm": { | ||
| "count": 3, | ||
| "machine_type": "e2-standard-2", | ||
| "gcs_path": "syzkaller/disks", |
There was a problem hiding this comment.
This path should be configurable somewhere, as it depends on the used GCP project.
| return qemuBaseConfigJSON | ||
| } | ||
|
|
||
| func ReadFuzzConfig(config string) *api.FuzzConfig { |
There was a problem hiding this comment.
Shall we then also switch fuzz and retest to use it? fuzz still has readFuzzConfig.
Nit: I think it could be called ReadFromFile as it's evident from the package name that it's about fuzz config.
| var err error | ||
| mgrCfg.VM, err = config.MergeJSONs(mgrCfg.VM, []byte( | ||
| `{"qemu_args": "-machine q35,nvdimm=on,accel=kvm,kernel-irqchip=split -cpu max,migratable=off -enable-kvm -smp 2,sockets=2,cores=1"}`)) | ||
| if mgrCfg.Type == "qemu" { |
There was a problem hiding this comment.
For gce, we may want to change the machine type from e2 to some that supports nested virtualization. AFAIK it's either n2 or n2d (needs double-checking).
| return nil, | ||
| SkipError(fmt.Sprintf("only gce and qemu vms are supported now. %s is not supported", target.FuzzConfig.VMType)) | ||
| } | ||
| if result.Arch != "amd64" { |
There was a problem hiding this comment.
Maybe move it to https://github.com/google/syzkaller/blob/master/syz-cluster/pkg/app/config.go's validation checks?
The arch ultimately comes from Trees/FuzzTargets anyway.
There was a problem hiding this comment.
Would be very nice to see some gce tests in testdata/.
| return "", err | ||
| } | ||
|
|
||
| // Attempt 3 times to update the label. |
There was a problem hiding this comment.
I'm hesitating whether we want to renew the image label on VM creation.
In syz-manager, you could pause and resume its execution:
Line 137 in d9b7f62
If the manager is paused for longer than VMRunningTime, session will break.
| } | ||
|
|
||
| func (ctx *Context) DeleteImageWithRequestID(imageName, requestID string) error { | ||
| // A request ID ensures that if two managers send the same request, deletion will only be attempted once. |
There was a problem hiding this comment.
What error would we get in that case? If one manager would get success and the other something like "operation in progress" or "not found", maybe it's okay to just tolerate those errors?
| "vm": { | ||
| "count": 3, | ||
| "machine_type": "e2-standard-2", | ||
| "gcs_path": "syzkaller/disks", |
There was a problem hiding this comment.
One potential problem to double-check: AFAIK we are first uploading the image to the GCS bucket on gcs_path. Do we use manager's name for the file?
In that case, we could accumulate there a lot of garbage.
Before sending a pull request, please review Contribution Guidelines:
https://github.com/google/syzkaller/blob/master/docs/contributing.md