The way CTRL-C in Postgres CLI cancels queries is incredibly hack-y

(neon.com)

128 points | by andrenotgiant 3 days ago

8 comments

kelnos 4 hours ago
> There are architectural reasons why psql doesn’t yet use libpq’s encrypted cancellation functions (it “would need a much larger refactor to be able to call them due to the new functions not being signal-safe”)
This surprised me. I was like, "surely socket()/connect()/send()/recv() aren't async signal safe!" But after a quick trip to `man signal-safety`, it turns out they are, which surprised me. I guess it shouldn't, perhaps: likely all of those functions are little more than wrappers around the corresponding syscalls, so there isn't any libc state to possibly corrupt or deadlock you if you use them in a signal handler. And I assume the kernel needs to keep itself in a consistent, non-deadlockable state before it calls a signal handler anyway.
(And I'm not at all surprised that whatever TLS library they're using calls things or is itself not async signal safe.)
Either way, wow! In 2026 it feels absolutely bonkers that a software dev team would continue to put out something like this. Honestly, once psql got TLS support, when you make a TLS connection it should have put up a big warning and ask you, "This program cannot cancel queries over a secure channel; do you still want to enable query cancellation?" Or hell, just disable query cancellation in those cases and not even give an option.
I guess this is "just" a DoS, though, and only in cases where someone authorized is poking around using psql while connected to a server exposed to the public internet. Hopefully that situation isn't common. And even if it is, there's no opportunity for data exfiltration or RCE, so... the author's "heebie-jeebies level 6" feels appropriate.
(And there's an easy mitigation if you know the issue: once you cancel a query with ctrl+c, quit the psql session and start a new one. That will give you the process a new "cancellation key", and the old one from the old process won't work for an attacker anymore.)
rlpb 17 hours ago
TCP has an "urgent data" feature that might have been used for this kind of thing, used for Ctrl-C in telnet, etc. It can be used to bypass any pending send buffer and received by the server ahead of any unread data.
[-]
- mike_hearn 16 hours ago
  Fun fact: Oracle implements cancellation this way.
  The downside is that sometimes connections are proxied in ways that lose these unusual packets. Looking at you, Docker...
- ralferoo 11 hours ago
  Just googling it now and TCP urgent data seems to be a mess.
  Reading the original RFC 793 it's clear that the intention was never for this to be OOB data, but to inform the receiver that they should consume as much data as possible and minimally process it / buffer it locally until they have read up to the urgent data.
  However, the way it was historically implemented as OOB data seems to be significantly more useful - you could send flow control messaging to be processed immediately even if you knew the receiving side had a lot data to consume before it'd see an inline message.
  It seems nowadays the advice is just to not use urgent data at all.
- ZiiS 14 hours ago
  Unfortunately the can be many buffers between you and the server which "urgent data" doesn't skip by design. (the were also lots of implementation problems)
ZiiS 14 hours ago
A good write up explaining how assumptions of network and security design have changed so much over the years. Also you have to give credit nowadays for not overly sensationalizing 'heebie-jeebies level 6'. I certainly continue reusing a connection I assumed was TLS after a cancel so was vulnerable to a DoS; but equally if the next statement was canceled I would switch to a new connection no harm no foul.
jtwaleson 18 hours ago
From the title I was hoping for this being hacky on the server application side, like how it aborts and clears the memory for a running query.
Still an interesting read. Just wondering, why can't the TCP connection of the query not be used to send a cancellation request? Why does it have the be out of band?
[-]
- mike_hearn 16 hours ago
  Because Postgres is a very old codebase and was written in a style that assumes there are no threads, and thus there's nothing to listen for a cancellation packet whilst work is getting done. A lot of UNIXes had very poor support for threads for a long time and so this kind of multi-process architecture is common in old codebases.
  The TCP URG bit came out of this kind of problem. It triggers a SIGURG signal on UNIX which interrupts the process. Oracle works this way.
  These days you'd implement cancellation by having one thread handle inbound messages and another thread do the actual work with shared memory to implement a cooperative cancellation mechanic.
  But we should in general have sympathy here. Very little software and very few protocols properly implements any form of cancellation. HTTP hardly does for normal requests, and even if it did, how many web servers abort request processing if the connection drops?
  [-]
  - Someone 13 hours ago
    > The TCP URG bit came out of this kind of problem. It triggers a SIGURG signal on UNIX which interrupts the process. Oracle works this way.
    https://datatracker.ietf.org/doc/html/rfc6093:
    “it is strongly recommended that applications do not employ urgent indications. Nevertheless, urgent indications are still retained as a mandatory part of the TCP protocol to support the few legacy applications that employ them. However, it is expected that even these applications will have difficulties in environments with middleboxes.”
  - marcosdumay 10 hours ago
    > how many web servers abort request processing if the connection drops?
    I don't think I have ever seen a published web service which error log wasn't full of broken pipe messages. So, AFAIK, all.
    [-]
    - toast0 9 hours ago
      You only get a broken pipe when you write, which is often after you've already done most of the work.
  - asveikau 10 hours ago
    > These days you'd implement cancellation by having one thread handle inbound messages and another thread do the actual work with shared memory to implement a cooperative cancellation mechanic.
    Doesn't necessarily need a thread per connection. Could be on an epoll/kqueue/io-uring.
    The query would need to periodically re-check a cancellation flag, which has costs and would come with a delay if it's particularly busy.
  - johannes1234321 14 hours ago
    It isn't really easy to do. A client may send tons of data over the connection, probably data which is calculated by the client as the client's buffer empties. If the server clears the buffers all the time to check for a cancellation it may have quite bad consequences.
- toast0 18 hours ago
  I don't know much about postgres, but as I understand it, it's a pretty standard server application. Read a request from the client, work on the request, send the result, read the next request.
  Changing that to poll for a cancellation while working is a big change. Also, the server would need to buffer any pipelined requests while looking for a cancellation request. A second connection is not without wrinkles, but it avoids a lot of network complexity.
- bob1029 18 hours ago
  MSSQL uses a special message over an existing connection:
  https://learn.microsoft.com/en-us/openspecs/windows_protocol...
- CamouflagedKiwi 16 hours ago
  It's basically got a thread per connection, while it's working on a query that thread isn't listening to incoming traffic on the network socket any more.
- hlinnaka 15 hours ago
  Because then the cancellation request would get queued behind any other data that's in flight from the client to the server. In the worst case the TCP buffers are full, and the client cannot even send the request until the server processes some of the existing data that's in-flight.
  [-]
  - adrian_b 13 hours ago
    As others have said, TCP allows sending urgent packets, precisely for solving this problem.
    At the receiver, a signal handler must be used, which will be invoked when an urgent packet is received, with SIGURG.
michalc 18 hours ago
I think I can understand why this wasn’t addressed for so long: in the vast majority of cases if your db is exposed on a network level to untrusted sources, then you probably have far bigger problems?
[-]
- hrmtst93837 15 hours ago
  That's the kind of hand-wave that turns into a CVE later. Network exposure is one thing, but weird signal handling in local tooling can still become a cross-session bug or a nasty security footgun on shared infra, terminals, or jump boxes.
  If you have shared psql sessions in tmux or on a jump box one bad cancel can trash someone else's work. 'Just firewall it' is how you end up owned by the intern with shell access.
- pilif 16 hours ago
  it's also very tricky to do given the current architecture on the server side where one single-threaded process handles the connection and uses (for all intents and purposes) sync io.
  In such a scenario, listening (and acting) on cancellation requests on the same connection becomes very hard, so fixing this goes way beyond "just".
kardianos 11 hours ago
In general I love postgres. There are to problems with postgresql in my book: the protocol (proto3) and no great way to directly query using a different language.
The protocol has no direct in-protocol cancellation, like TDS has. TDS does this by making a framed protocol, at the application protocol level it can cancel queries. It has two variants (text and binary) and can cause fragmentation, and at the query and protocol level only supports positional parameters, no named parameters.
One a query is on the server, it doesn't support directly acting on a language mode. I don't want to go into SQL mode and create a PL/SQL proc, I just want direct PL/SQL. Can't (really) do that well. Directly returning multiple result sets (eg for a matrxi, separate rows, columns, and fields) or related queries in a single round trip is technically possible, but hard to do. So frustrating.
i18nagentai 13 hours ago
What strikes me most about this is how it illustrates the tension between backward compatibility and security in long-lived systems. The cancel key approach made total sense in the context of early Unix networking assumptions, but those assumptions have quietly eroded over decades. The fact that the cancel token is only 32 bits of entropy and sent in cleartext means it was never really designed for adversarial environments -- it was a convenience feature that became load-bearing infrastructure. I wonder if the Postgres community will eventually move toward a multiplexed protocol layer (similar to what HTTP/2 did for HTTP) rather than trying to bolt security onto the existing out-of-band mechanism.
[-]
- dmurray 13 hours ago
  Doesn't it also make sense in the context of modern networking assumptions?
  I've never had to connect to PostGres in an adversarial environment. I've been at work or at home and I connected to PostGres instances owned by me or my employer. If I tried to connect to my work instance from a coffee shop, the first thing I'd do would be to log in to a VPN. That's your multiplexed protocol layer right there: the security happens at the network layer and your cancel happens at the application layer.
  This is a different situation from websites. I connect to websites owned by third parties all the time, and I want my communication there to be encrypted at the application layer.
  [-]
  - xmcqdpt2 12 hours ago
    Zero trust security which is becoming increasingly common is based on removing the internal / external network dichotomy entirely. Everything should be assumed to be reachable from the open internet (so SSO, OIDC everywhere.)
  - somat 10 hours ago
    It makes me think of ipsec, ipsec was originally intended to be used sort of the the same as we use tls today, but application independent. when making a connection to a random remote machine the os would see if it could spool up a ipsec sa. No changes to a user program would be needed. But while they were faffing about trying to overcomplicate ipsec ssl came along and stole it's lunch money.
    This application of ipsec was never used and barely implemented. Today getting it to make ad-hoc connections is a tricky untested edge case and ipsec was regulated to dedicated tunnels. Where everyone hates it because it is too tricky to get the parameters aligned.
    There is definitely a case to be made that it is right and proper that secure connections are handled in the application(tls), But sometimes I like to think of how it could have been. where all applications get a secure connection whether they want one or not.
    As a useless dangling side thought, an additional piece would be needed for ad-hoc ipsec that as far as I know was never implemented, a way to notify the OS that this connection must be encrypted(a socket option? SO_ENC?). This is most of the case for encrypted connections being the duty of the application.
  - gruez 12 hours ago
    >I've never had to connect to PostGres in an adversarial environment.
    heroku's postgres database service still exposes itself on the public internet.
- tensegrist 12 hours ago
  is sed s/—/--/ the new meta
  [-]
  - kelnos 4 hours ago
    I have used "--" as a lazy-man's emdash for decades at this point. Once I heard that people started assuming text that uses emdashes was written by an LLM I got worried that people were going to think that I'm an LLM, but then I realized the LLMs use the real unicode emdash character, while I just use two regular ASCII-zone hyphens. Whew.
    (Also I just learned that ASCII 0x2d/unicode U+002D is more properly called a "hyphen" [well, "HYPHEN-MINUS"], not a "dash".)
  - shilgapira 12 hours ago
    offtopic, but it's interesting how large of a discrepancy there is between the length of your comment and how much time i'd have to spend explaining background info to a non-programmer to get them to understand why this is funny
    [-]
    - pas 9 hours ago
      why is it funny? it seems like a sincere question
      [-]
      - petit_robert 3 hours ago
        The way it's written is funny, I find. But I'm a programmer...
        And as GP wrote, it would take a substantial amount of time to explain to a normie (infinite I'd say, but let's not despair).
  - arvyy 12 hours ago
    hardly new, I've used it before advent of llm popularity, and I wasn't alone
  - mattkrause 11 hours ago
    It should be THREE hyphens for an em-dash!
    [-]
    - kelnos 4 hours ago
      In theory, yes, endash would be "--" and emdash would be "---", but oof, the three hyphens looks like way too much in normal text. So I've always used "--".
  - paulddraper 11 hours ago
    I've always used it.
    My keyboard has -.
- megous 9 hours ago
  cancel key is arbitrary sized
  https://www.postgresql.org/docs/current/protocol-message-for... / BackendKeyData
  I'm fairly certain that this cancellation approach has nothing to do with UNIX networking assumptions, and everything to do with the connection/process model of PostgreSQL.
  Creating a connection => starting a process and passing the accepted socket to it (so in-band cancel would have to go directy to the backend executing the query) + single-threaded backend process not reading from socket when executing a query, so it would get the cancellation request only after the query finishes (or even after all pipelined queries before it finish, which is even worse).
gpderetta 13 hours ago
TLS is not async signal safe. But having a dedicated thread whose responsibility is to only send cancel tokens via a TLS connection and is woken up by a posix semaphore seems a small, self contained change that doesn't require any major refactoring.
[-]
- eqvinox 10 hours ago
  This doesn't really have anything to do with async signal safety. You could perfectly fine capture the Ctrl+C in psql, stick a notifier byte in a pipe (or use signalfd to begin with) and handle it as a synchronous event in a main loop. You'd still need to establish a new connection purely to bypass buffered data. (or use TCP URG, but that seems generally a poor idea.)
  [-]
  - kelnos 4 hours ago
    Well, it does, because -- as the article notes -- psql creates and sends on the new connection inside the signal handler, and that doing the pipe-write thing instead (required since their TLS library is presumably not async signal safe) would require a major refactor of the code.
    Likely psql doesn't even have a "main loop"; I expect it just blocks on recv() until it gets a response from the server. And on Linux, I think it will automatically restart/resume syscalls that were in progress when a signal fires, so you can't even rely on EINTR to get you out of that recv() so you could check a global flag that you could set in the signal handler.
    Although, reading the sigaction() manpage, if you don't specify SA_RESTART, it shouldn't do this? (If they are using signal() and not sigaction(), it might always restart?) But still, not sure why they don't take that route. I imagine it would require much less of a refactor to set a global flag, and then always check it after a recv() fails with EINTR.
    Sure, the "right" thing to do is have a global pipe, and instead of blocking in recv(), poll() on it with both the connection socket and the read end of the pipe. And I bet that would require a bit of a refactor. But a global flag is somewhere in the middle...
    But who knows; I've never read their source code, so I expect they know what they're talking about when they say it's not a trivial fix.
  - gpderetta 5 hours ago
    > This doesn't really have anything to do with async signal safety.
    TLS not being async signal safe is explicitly called out on the article as the reason the token is sent in clear text.
    > Handle it as a synchronous event in a main loop
    Of course of you rearchitect the client there are better solutions. But again, the article mentions that's not planned for now.
    By comparison, delegating cancellation to a background background thread can be done non-intrusively. In principe no code outside the cancel path need changing.
    Edit: the article mentions that there is a refactor in the works to implement cancel over tls [1]. Turns out that they decided to use a thread (with a pipe for signaling).
    [1] https://www.postgresql.org/message-id/flat/DEY0N7FS8NCU.1F7Q...
    [-]
    - kelnos 3 hours ago
      > By comparison, delegating cancellation to a background background thread can be done non-intrusively. In principe no code outside the cancel path need changing.
      pthread_create() isn't async signal safe, though, so they can't simply move their socket code for the cancellation into another function and call pthread_create() on it. They still have to get the main thread to stop doing what its doing (usually via the pipe trick) in order to create the thread, which could easily be a big refactor.
      > Edit: the article mentions that there is a refactor in the works to implement cancel over tls [1]. Turns out that they decided to use a thread (with a pipe for signaling).
      Seems odd to me to bother. If you have to do the pipe thing, why not just do the new connection for cancellation in the main thread once it sees the data on the pipe? I guess that way they can return control of the CLI to the user while they cancel in the background, rather than blocking the user while the cancellation is going on. But as a user, I kinda would like to know that the query I just cancelled actually got cancelled, a property that the old code has, but the new code won't.
      (Presumably the new code can print a warning if cancellation fails, but it could take a long time to fail, and in the meantime the user has moved on.)
      [-]
      - gpderetta 3 hours ago
        Of course you don't spawn a thread from the signal handler. You start it first thing in main and park it waiting for a wakeup.
  - pas 8 hours ago
    I believe the suggestion is to have a TLS endpoint in the server, which demultiplexes the incoming CancelRequest and signals to the corresponding worker process via shared memory
    [-]
    - kelnos 3 hours ago
      The problem isn't on the server; the server already knows how to cancel things, and already supports cancellation over TLS. It's just that psql doesn't use it, due to the need for a refactor to make that work. Other psql-like frontends do already use it, as the article points out.
      [-]
      - pas 3 hours ago
        ah, true, thanks! (unfortunately I can't delete/edit the comment.)