[us-commits] [ehb54/us3lims_common] a6f9e2: Validate sbatch submission status before recording...
emre brookes
noreply at github.com
Sat Jun 20 15:07:49 MDT 2026
Branch: refs/heads/ehb54-issue-915
Home: https://github.com/ehb54/us3lims_common
Commit: a6f9e2ef5a75d86857bb1a34acafa0e6573a7b62
https://github.com/ehb54/us3lims_common/commit/a6f9e2ef5a75d86857bb1a34acafa0e6573a7b62
Author: ehb54 <brookes at uthscsa.edu>
Date: 2026-06-20 (Sat, 20 Jun 2026)
Changed paths:
M class/submit_local.php
M global_config.php.template
Log Message:
-----------
Validate sbatch submission status before recording a gfacID
submit_job() never checked the exit status of the sbatch/qsub exec()
call, and parsed the job ID positionally from the first output line
assuming it was always "Submitted batch job <N>". Under scheduler
load, sbatch can instead fail with e.g. "sbatch: error: Batch job
submission failed: Socket timed out on send/recv operation" - the
positional parse of that line happens to land on the literal word
"job", which then got stored as gfacID and tracked through the rest
of the pipeline as if a real job existed, eventually surfacing much
later and opaquely as "Failed data fetch" during cleanup.
Validate the sbatch output against /^Submitted batch job\s+(\d+)/
and check exec()'s exit status before accepting a job ID. Retry the
submission with exponential backoff on failure (configurable via
$global_sbatch_submit_retries / $global_sbatch_submit_retry_wait_seconds
in global_config.php, overridable per-cluster), since this class of
failure is transient and load-related. If all retries are exhausted,
update_db() now marks the request SUBMIT_TIMEOUT with the real error
instead of recording an empty/bogus gfacID and launching a jobmonitor
to watch a job that was never submitted.
Fixes ehb54/ultrascan-tickets#915
To unsubscribe from these emails, change your notification settings at https://github.com/ehb54/us3lims_common/settings/notifications
More information about the us-commits
mailing list