I did this a few years ago with a stack of pi 4s connected to a four port PoE switch. One was an openWRT router, one was a plex server connected to some spinning discs via usb, and I had another you could plug an hdmi cable into and use to view the media. I eventually found out I could host the whole thing on a single pi, but it was still a fun project. Could probably do it all on a pi 5 with an nvme hat no problem. Might look into that when I get the spare tinkering money.
I thought whisper was hallucinating huge chunks of text in that medical transcription app. Is it more reliable with smaller chunks?