Workers should validate their tokens at startup
Description
Workers require an access token for downloading job files from the core. While job descriptions arrive over the messaging socket, the actual file needs to be pulled in authenticated API endpoint as it can be quite large.
Recent (> 7 mo old) tokens are valid for a very long time, but misconfiguration (passing an invalidated token, copy+paste/typing errors, etc.) can still cause a worker to be unable to download job files. This causes a rather vague "Incomplete or error response 403" message to be returned.
To start, this message could be improved. If the JobDownloader receives a 403 Forbidden it can emit an error saying that the token is invalid. It should also do this for an HTTP 401, actually, as that would be the correct response is the case of an expired token.
Workers should also actively check whether the token they were given is usable. This should happen at startup, when the administrator is still in the mindset of configuring workers. There are two logical routes for this:
- The worker requests information about the current user via
/api/v1/user
and checks that it yields a valid user with theWORKER
role; - The worker attaches the token in the JUMP
polo
message and the core verifies whether the token is valid.
The first method is easiest and requires no additional work in the core. This endpoint is already used by the UI to check the current logged-in status, which is precisely what the worker needs to do as well.
The second method is more involved, and requires that the core perform some checks, at least when initiating a new worker. It is also more flexible, as it now knows which worker uses which token, and this adds some possibilities for improved worker management. It does mean that two-way communication is necessary in this case, as when the core finds that the token is invalid, it needs to inform the worker about it and boot it from the worker pool. In turn, the worker must log an error message that its token is invalid, and terminate.
Priority
Medium -- the error should not happen too often and if it does it's easy enough to diagnose. It remains a major annoyance if it does happen, however.
Definition of done
-
- improve the error message when an attempt at downloading a job yields a 403 or 401 -
- (optionally) clear up 403/401 confusion in the core -
- either: -
- have the worker poll /api/v1/user
to check whether the token is a valid worker token -
- have the worker attach its token to polo
messages and let the core verify it, complete with proper handling on the worker side
-