distributed computing - How to run Tensorflow on SLURM cluster with properly configured parameter server? -
i in fortunate position of having access university's slurm powered gpu cluster. have been trying tensorflow run in cluster node, far have failed find documentation. (everyone have spoken @ university has run using cpu nodes before or using single gpu node.
i found excellent bit of documentation previous question here. unfortunately, it's rather incomplete. of other distributed examples have found such such this 1 rely on explicitly specifying parameter server.
when try run using code question, appears work until either fails connect nonexistent parameter server or hangs when server.join called , no print outs provided sbatch outfile (which understand should happen).
so in short, question how 1 go starting tensorflow on slurm cluster? sbatch stage onwards. first time dealing distributed computing framework besides spark on aws , love learn more how configure tensorflow. how specify 1 of items in tf_hostlist example server parameter server? alternatively can use sbatch send different commands each worker have seen done in other examples?
Comments
Post a Comment