Two Tiny ZIO Stream Trappings: How Even Good Frameworks Still Require The Use Of Your Brain

2021/10/21

ZIO is a pretty good (not to say great!) framework. I can’t compete with Natan Silnitsky’s pitfalls to avoid, which are a good little tour for any newcommer to this subject.

Having relied on it in the last year in a production setting, I can however add a few cents on the subject of zio-streams, a powerfull addendum to the ZIO library that, you guessed it, let’s you do stuff in a streaming fashion.

Check out the doc if this is totally foreign to you.

This post seeks to help with diagnosing the two following situations:

Brains, Performance & Deadlocks

ZIO is surounded by quite some hype (some of which it deserves, if I may add). It goes along these lines:

The framework is extremely performant: it will let you do many things concurrently and safely!

Reading this, and being a bit caught up in the (sometimes steep) learning curve that comes with effect systems, you may interpret the above as:

This awesome framework takes care of everything: no brains required!

Like most things computer related and especially in programming: it’s still your responsibility to explain to the computer what to do. It can’t guess, and if you expect it to, you may run into slight frustrations…

Streams & Parallelism – Performance May Need To Be Requested

A not untypical scenario that benefits from streams might well be related to Kafka: I ran into this first little trap while learning both about zio-streams and zio-kafka.

Getting started is easy as pie, and because ZIO is awesome and stuff you think that if it works in your little dev environment, it will seamlessly scale in production.

Enters your 100mb/sec production stream of ~200'000 messages each second, and suddenly your app can’t keep up even though its CPU usage seems desperately low.

Darn, What’s the most likely culprit? Absence Of Parallelism

Don’t forget about mapMPar and flatMapPar

In you excitement, you may actually have forgotten to explicitely request to do certain things in parallel: if you’ve expressed your logic using .map or .flatMap only on your stream, well, that particular stage of your processing pipeline is guaranteed to be run on a single fiber.

That’s how I got to scratch my head pretty hard in a situation where, given a few dozen kafka partitions for my fat topic, ZIO still failed to suck up the few CPU’s I threw at the task.

The solution was easy: replace the mapM/flatMap that where applying the business logic to the stream with mapMPar (or some variants). Suddenly the servers were blowing hot air again.

Grouping Streams & Deadlocks – Smarter Is Not Always Better

The second, fun little trap I got a leg into was related to the useage of groupBy and groupByKey.

It’s important to remember that streams are potentially infinite, and that this puts some constraints on how the plumbing of a framework like ZIO is implemeted.

Let me illustrate how this can have annoying consequences under certain circumstances:

My use case required to do some grouping of entries in a stream, using stream.groupByKey: that function will helpfully push entries with the same key to their own sub-stream, so you can do stuff with them at a later stage.

Being tentatively smart, I told myself let’s do the grouping here, return the stream and work on it at a later stage. Like so:

ZStream
      .fromIterable(collection)
      .groupByKey(_.key) {
        case (key, streamOfValues) =>
          ZStream.succeed(
            (key, streamOfValues)
          )
      }
      .flatMap { // Or possibly flatMapPar, because it seems so much smarter
        case (key, streamOfValues) =>
          ZStream.fromEffect(
            streamOfValues.map { unitunnels =>
              <do_some_expensive_stuff>
              )
            }.run(ZSink.collectAllToSet)
          )
      } 

Symptoms

This is all nice and good, but for situations where collection gets very big and has many different keys, you may actually run into a deadlock and all processing will stop.

How so? Because, although they can be infinite in theory, streams still need to be backed by something, and this something usually tends to be finite. For example, by default, groupBy will push entries belonging to the same key into sub-streams backed by a buffer of size 16.

And what happens when such a buffer is full and an attempt is made to add to it? That attempt will block, and in the case of our groupBy, distribution of entries to sub-streams will simply stop.

Meanwhile, you may have another downstream process that is waiting for said grouping to be done before proceeding with the consumption of all grouped streams. Result? You just put yourself into really nice deadlock.

Try To Consume Grouped Streams Right Away

The obvious (and obviously pedantic) answer here, of course, is don’t write anything that can lead to a deadlock. Duh. But that’s actually easier said than done, especially so when the hype leads you to believe that it’s safe and impossible to shoot yourself in the foot.

So here are some ideas around avoiding dead-locks with respect to grouping stream entries:

Here’s a link to the original Discord discussion if you’re curious.

Conclusion

ZIO is quite a good tool, but as for many other things in life, it does not dispense you from using your brain. Said differently: it’s always good to increase your understanding of what is happenning behind the curtains.

On another note: both issues described above were swiftly resolved with a few questions to ZIO’s Discord: if you’re blocked in any way, you’re very likely to find a friendly ear over there. Itamar Ravid and Adam Fraser where both very helpful.