Jobs

Learn the concept behind AI Training jobs

Last updated 18th May, 2021.

Definition

A job in AI Training is the workload unit submitted to the cluster. A job runs as a Docker container within OVHcloud infrastructure.

Each job is linked to a Public Cloud project and specifies an amount of resources to use to run the training task along with a Docker image either publicly available, in the AI Training shared registry scoped to your project or the private registry of your choosing that you added. For the latter, see the OVHcloud documentation on how to add a private registry.

Considerations

  • A job will run indefinitely until completion or manual interruption.
  • Data can be attached to a job to serve either/both as input for your training workload or output (e.g. model weights).
  • If you do not customise you resource request, the default requested is 1 GPU. Memory is not customisable.
  • Billing for jobs is minute-based and starts at job initialisation until completion. Each commenced minute is billed completely.
  • You can read further on job limitations here.

Under the hood

Jobs in AI Training are Docker containers within OVHcloud infrastructure.

Job lifecycle

During its lifetime the job will transition between the following statuses:

  • Only jobs that reach the RUNNING status are billed. Billing starts with the INITIALIZING step and ends when the FINALIZING step starts.
  • Only jobs in states QUEUED, INITIALIZING, PENDING and RUNNING are included in the quota computation.
  • QUEUED the job run request is about to be processed
  • INITIALIZING the job instance is created and the data is synchronised from the Object Storage. To know more about the data synchronisation check out the Data How it works section.
  • PENDING job is being started
  • RUNNING the job is running
  • INTERRUPTING the job is still running but an interruption order was received and is about to be processed
  • FINALIZING the job instance is deleted and the data is synchronised back to the Object Storage. To know more about the data synchronisation check out the Data How it works section.
  • DONE the job ended normally
  • TIMEOUT the job is still running but is about to be interrupted because the timeout was reached
  • INTERRUPTED the job is ended and was interrupted
  • FAILED the job ended with an error, e.g. the process in the job finished with a non 0 exit code, Docker image could not be pulled, ...
  • ERROR the job ended due to a backend error

image

Going further

Feedback

Please send us your questions, feedback and suggestions to improve the service:


Did you find this guide useful?

Please feel free to give any suggestions in order to improve this documentation.

Whether your feedback is about images, content, or structure, please share it, so that we can improve it together.

Your support requests will not be processed via this form. To do this, please use the "Create a ticket" form.

Thank you. Your feedback has been received.


These guides might also interest you...

OVHcloud Community

Access your community space. Ask questions, search for information, post content, and interact with other OVHcloud Community members.

Discuss with the OVHcloud community