AI Training - Job concept

Learn the concept behind AI Training jobs

Last updated 18th May, 2021.

Definition

A job in AI Training is the workload unit submitted to the cluster. A job runs as a Docker container within OVHcloud infrastructure.

Each job is linked to a Public Cloud project and specifies an amount of resources to use to run the training task along with a Docker image either publicly available, in the AI Training shared registry scoped to your project or the private registry of your choosing that you added. For the latter, see the OVHcloud documentation on how to add a private registry.

Considerations

  • A job will run indefinitely until completion or manual interruption.
  • Data can be attached to a job to serve either/both as input for your training workload or output (e.g. model weights).
  • If you do not customise you resource request, the default requested is 1 GPU. Memory is not customisable.
  • Billing for jobs is minute-based and starts at job initialisation until completion. Each commenced minute is billed completely.
  • You can read further on job limitations here.

Under the hood

Jobs in AI Training are Docker containers within OVHcloud infrastructure.

Job lifecycle

During its lifetime the job will transition between the following statuses:

  • Only jobs that reach the RUNNING status are billed. Billing starts with the INITIALIZING step and ends when the FINALIZING step starts.
  • Only jobs in states QUEUED, INITIALIZING, PENDING and RUNNING are included in the quota computation.
  • QUEUED the job run request is about to be processed
  • INITIALIZING the job instance is created and the data is synchronised from the Object Storage. To know more about the data synchronisation check out the Data How it works section.
  • PENDING job is being started
  • RUNNING the job is running
  • INTERRUPTING the job is still running but an interruption order was received and is about to be processed
  • FINALIZING the job instance is deleted and the data is synchronised back to the Object Storage. To know more about the data synchronisation check out the Data How it works section.
  • DONE the job ended normally
  • TIMEOUT the job is still running but is about to be interrupted because the timeout was reached
  • INTERRUPTED the job is ended and was interrupted
  • FAILED the job ended with an error, e.g. the process in the job finished with a non 0 exit code, Docker image could not be pulled, ...
  • ERROR the job ended due to a backend error

image

Going further

Feedback

Please send us your questions, feedback and suggestions to improve the service:


Esta documentação foi-lhe útil?

Não hesite em propor-nos sugestões de melhoria para fazer evoluir este manual.

Imagens, conteúdo, estrutura... Não hesite em dizer-nos porquê para evoluirmos em conjunto!

Os seus pedidos de assistência não serão tratados através deste formulário. Para isso, utilize o formulário "Criar um ticket" .

Obrigado. A sua mensagem foi recebida com sucesso.


Estes manuais também podem ser úteis...

OVHcloud Community

Aceda ao seu espaço comunitário. Coloque as suas questões, procure informações e interaja com outros membros do OVHcloud Community.

Discuss with the OVHcloud community

Em conformidade com a alteração à Diretiva 2006/112/CE, os preços com IVA podem variar de acordo com o país de residência do cliente
(por defeito, os preços com IVA apresentados incluem o IVA português em vigor).