distributed computing - How to run Tensorflow on SLURM cluster with properly configured parameter server? -


i in fortunate position of having access university's slurm powered gpu cluster. have been trying tensorflow run in cluster node, far have failed find documentation. (everyone have spoken @ university has run using cpu nodes before or using single gpu node.

i found excellent bit of documentation previous question here. unfortunately, it's rather incomplete. of other distributed examples have found such such this 1 rely on explicitly specifying parameter server.

when try run using code question, appears work until either fails connect nonexistent parameter server or hangs when server.join called , no print outs provided sbatch outfile (which understand should happen).

so in short, question how 1 go starting tensorflow on slurm cluster? sbatch stage onwards. first time dealing distributed computing framework besides spark on aws , love learn more how configure tensorflow. how specify 1 of items in tf_hostlist example server parameter server? alternatively can use sbatch send different commands each worker have seen done in other examples?


Comments

Popular posts from this blog

java - SSE Emitter : Manage timeouts and complete() -

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -