fix(mkdir): pl hydra with ddp, will have runtime.dir not follow default. HARD CODE now.

Kin-Zhang · Kin-Zhang · commit 01ae5bd7daab · 2024-07-24T23:20:43.000+02:00
check following issues in hydra to sync: facebookresearch/hydra#2070 docs(readme): fix #2 about hyperlink missing in dataprocess commands.
diff --git a/1_train.py b/1_train.py
@@ -3,7 +3,8 @@
 # Copyright (C) 2023-now, RPL, KTH Royal Institute of Technology
 # Author: Qingwen Zhang  (https://kin-zhang.github.io/)
 #
-# This file is part of DeFlow (https://github.com/KTH-RPL/DeFlow).
+# This file is part of DeFlow (https://github.com/KTH-RPL/DeFlow) and 
+# SeFlow (https://github.com/KTH-RPL/SeFlow) projects.
 # If you find this repo helpful, please cite the respective publication as 
 # listed on the above website.
 
@@ -47,17 +48,20 @@ def main(cfg):
                             collate_fn=collate_fn_pad,
                             pin_memory=True)
                             
-    
     # count gpus, overwrite gpus
     cfg.gpus = torch.cuda.device_count() if torch.cuda.is_available() else 0
 
-    # only for logging on folder name.
+    output_dir = HydraConfig.get().runtime.output_dir
+    # overwrite logging folder name for SSL.
     if cfg.loss_fn == 'seflowLoss':
         cfg.output = cfg.output.replace(cfg.model.name, "seflow")
+        output_dir = output_dir.replace(cfg.model.name, "seflow")
         method_name = "seflow"
     else:
         method_name = cfg.model.name
-    output_dir = HydraConfig.get().runtime.output_dir + f"/{cfg.output}"
+
+    # FIXME: hydra output_dir with ddp run will mkdir in the parent folder. Looks like PL and Hydra trying to fix in lib.
+    # print(f"Output Directory: {output_dir} in gpu rank: {torch.cuda.current_device()}")
     Path(os.path.join(output_dir, "checkpoints")).mkdir(parents=True, exist_ok=True)
     
     cfg = DictConfig(OmegaConf.to_container(cfg, resolve=True))
diff --git a/README.md b/README.md
@@ -80,7 +80,7 @@ Note: Prepare raw data and process train data only needed run once for the task.
 ### Data Preparation
 
 Check [dataprocess/README.md](dataprocess/README.md#argoverse-20) for downloading tips for the raw Argoverse 2 dataset. Or maybe you want to have the **mini processed dataset** to try the code quickly, We directly provide one scene inside `train` and `val`. It already converted to `.h5` format and processed with the label data. 
-You can download it from [Zenodo](https://zenodo.org/record/12751363) and extract it to the data folder. And then you can skip following steps and directly run the [training script](#train-the-model).
+You can download it from [Zenodo](https://zenodo.org/records/12751363/files/demo_data.zip) and extract it to the data folder. And then you can skip following steps and directly run the [training script](#train-the-model).
 
 ```bash
 wget https://zenodo.org/record/12751363/files/demo_data.zip
@@ -89,7 +89,8 @@ unzip demo_data.zip -p /home/kin/data/av2
 
 #### Prepare raw data 
 
-Extract all data to unified h5 format. [Runtime: Normally need 10 mins finished run following commands totally in my desktop, 45 mins for the cluster I used]
+Checking more information (download raw data etc) in [dataprocess/README.md](dataprocess/README.md). Extract all data to unified h5 format. 
+[Runtime: Normally need 10 mins finished run following commands totally in my desktop, 45 mins for the cluster I used]
 ```bash
 python dataprocess/extract_av2.py --av2_type sensor --data_mode train --argo_dir /home/kin/data/av2 --output_dir /home/kin/data/av2/preprocess_v2
 python dataprocess/extract_av2.py --av2_type sensor --data_mode val --mask_dir /home/kin/data/av2/3d_scene_flow
@@ -122,8 +123,10 @@ python 1_train.py model=fastflow3d lr=2e-4 epochs=20 batch_size=16 loss_fn=ff3dL
 python 1_train.py model=deflow lr=2e-4 epochs=20 batch_size=16 loss_fn=deflowLoss
 ```
 
-Note: You may found the different settings in the paper that is all methods are enlarge learning rate to 2e-4 and decrease the epochs to 20 for faster converge (Through analysis, we also found it had better performance). 
-However, we kept the setting on lr=2e-6 and 50 epochs in the paper experiment for the fair comparison with ZeroFlow where we directly use their provided weights.
+> [!NOTE]  
+> You may found the different settings in the paper that is all methods are enlarge learning rate to 2e-4 and decrease the epochs to 20 for faster converge and better performance. 
+> However, we kept the setting on lr=2e-6 and 50 epochs in (SeFlow & DeFlow) paper experiments for the fair comparison with ZeroFlow where we directly use their provided weights. 
+> We suggest afterward researchers or users to use the setting here (larger lr and smaller epoch) for faster converge and better performance.
 
 ## 2. Evaluation
 
diff --git a/conf/config.yaml b/conf/config.yaml
@@ -28,7 +28,7 @@ gradient_clip_val: 5.0
 # optimizer ==> Adam
 lr: 2e-6
 loss_fn: seflowLoss # choices: [ff3dLoss, zeroflowLoss, deflowLoss, seflowLoss]
-add_seloss: 
+add_seloss: # {chamfer_dis: 1.0, static_flow_loss: 1.0, dynamic_chamfer_dis: 1.0, cluster_based_pc0pc1: 1.0}
 
 # log settings
 seed: 42069
diff --git a/conf/hydra/default.yaml b/conf/hydra/default.yaml
@@ -1,2 +1,2 @@
 run:
-  dir: logs/wandb
+  dir: logs/jobs/${output}/${now:%m-%d-%H-%M}
diff --git a/dataprocess/README.md b/dataprocess/README.md
@@ -34,7 +34,7 @@ s5cmd --no-sign-request cp "s3://argoverse/datasets/av2/sensor/test/*" sensor/te
 s5cmd --no-sign-request cp "s3://argoverse/tasks/3d_scene_flow/zips/*" .
 ```
 
-Then to quickly pre-process the data, we can [read more detail](../preprocess/README.md) on how to generate the pre-processed data for training and evaluation. This will take around 2 hour for the whole dataset (train & val) based on how powerful your CPU is.
+Then to quickly pre-process the data, we can [read these commands](#process) on how to generate the pre-processed data for training and evaluation. This will take around 0.5-2 hour for the whole dataset (train & val) based on how powerful your CPU is.
 
 More [self-supervised data in AV2 LiDAR only](https://www.argoverse.org/av2.html#lidar-link), note: It **does not** include **imagery or 3D annotations**. The dataset is designed to support research into self-supervised learning in the lidar domain, as well as point cloud forecasting.
 ```bash

Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`run:`
`2`		`- dir: logs/wandb`
	`2`	`+ dir: logs/jobs/${output}/${now:%m-%d-%H-%M}`