Hacker Newsnew | past | comments | ask | show | jobs | submit | Kovah's commentslogin

I wonder so often about many new CLI tools whose primary selling point is their speed over other tools. Yet I personally have not encountered any case where a tool like jq feels incredibly slow, and I would feel the urge to find something else. What do people do all day that existing tools are no longer enough? Or is it that kind of "my new terminal opens 107ms faster now, and I don't notice it, but I simply feel better because I know"?

I process TB-size ndjson files. I want to use jq to do some simple transformations between stages of the processing pipeline (e.g. rename a field), but it so slow that I write a single-use node or rust script instead.

Now I'm really curious. What field are you in that ndjson files of that size are common?

I'm sure there are reasons against switching to something more efficient–we've all been there–I'm just surprised.


> Now I'm really curious. What field are you in that ndjson files of that size are common?

I'm not OP,but structured JSON logs can easily result in humongous ndjson files, even with a modest fleet of servers over a not-very-long period of time.


So what's the use case for keeping them in that format rather than something more easily indexed and queryable?

I'd probably just shove it all into Postgres, but even a multi terabyte SQLite database seems more reasonable.


Replying here because the other comment is too deeply nested to reply.

Even if it's once off, some people handle a lot of once-offs, that's exactly where you need good CLI tooling to support it.

Sure jq isn't exactly super slow, but I also have avoided it in pipelines where I just need faster throughput.

rg was insanely useful in a project I once got where they had about 5GB of source files, a lot of them auto-generated. And you needed to find stuff in there. People were using Notepad++ and waiting minutes for a query to find something in the haystack. rg returned results in seconds.


You make some good points. I've worked in support before, so I shouldn't have discounted how frequent "once-offs" can be.

The use case could be e.g. exactly processing an old trove of logs into something more easily indexed and queryable, and you might want to use jq as part of that processing pipeline

Fair, but for a once-off thing performance isn't usually a major factor.

The comment I was replying to implied this was something more regular.

EDIT: why is this being downvoted? I didn't think I was rude. The person I responded to made a good point, I was just clarifying that it wasn't quite the situation I was asking about.


At scale, low performance can very easily mean "longer than the lifetime of the universe to execute." The question isn't how quickly something will get done, but whether it can be done at all.

Good point. I said it above, but I'll repeat it here that I shouldn't have discounted how frequent once offs can be. I've worked in support before so I really should've known better

Certain people/businesses deal with one-off things every day. Even for something truly one-off, if one tool is too slow it might still be the difference between being able to do it once or not at all.

This reminds me of someone who wrote a regex tool that matches by compiling regexes (at runtime of the tool) via LLVM to native code.

You could probably do something similar for a faster jq.


I would love, _love_ to know more about your data formats, your tools, what the JSON looks like, basically as much as you're willing to share. :)

For about a month now I've been working on a suite of tools for dealing with JSON specifically written for the imagined audience of "for people who like CLIs or TUIs and have to deal with PILES AND PILES of JSON and care deeply about performance".

For me, I've been writing them just because it's an "itch". I like writing high performance/efficient software, and there's a few gaps that it bugged me they existed, that I knew I could fill.

I'm having fun and will be happy when I finish, regardless, but it would be so cool if it happened to solve a problem for someone else.


I maintain some tools for the videogame World of Warships. The developer has a file called GameParams.bin which is Python-pickled data (their scripting language is Python).

Working with this is pretty painful, so I convert the Pickled structure to other formats including JSON.

The file has always been prettified around ~500MB but as of recently expands to about 3GB I think because they’ve added extra regional parameters.

The file inflates to a large size because Pickle refcounts objects for deduping, whereas obviously that’s lost in JSON.

I care about speed and tools not choking on the large inputs so I use jaq for querying and instruction LLMs operating on the data to do the same.


This isn't for you then

> The query language is deliberately less expressive than jq's. jsongrep is a search tool, not a transformation tool-- it finds values but doesn't compute new ones. There are no filters, no arithmetic, no string interpolation.

Mind me asking what sorts of TB json files you work with? Seems excessively immense.


> Uses jq for TB json files

> Hadoop: bro

> Spark: bro

> hive: bro

> data team: bro


made me remember this article

<https://adamdrake.com/command-line-tools-can-be-235x-faster-...>

  Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

  Conclusion: Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.

This article is good for new programmers to understand why certain solutions are better at scale, there is no silver bullet. And also, this is from 2014, and the dataset is < 4GB. No reason to use hadoop.

The discussion we had here was involving TB of data, so I'm curious how this is faster with CLIs rather than parallel processing...


JQ is very convenient, even if your files are more than 100GB. I often need to extract one field from huge JSON line files, I just pipe jq to it to get results. It's slower, but implementing proper data processing will take more time.

More than 100GB can be 101GB, 500GB or 1TB+. I was speaking about 1TB+ files. I'm not sure you can get it faster unless you have a parallel processor.

are those tools known for their fast json parsers?

If we talk about TB or PB+ scales, then yes.

Oh, can you post some benchmarks? I didn't know that parser throughput per core would change with the amount of data like that.

Deal with really big log files, mostly.

If you work at a hyperscaler, service log volume borders on the insane, and while there is a whole pile of tooling around logs, often there's no real substitute for pulling a couple of terabytes locally and going to town on them.


> often there's no real substitute for pulling a couple of terabytes locally and going to town on them.

Fully agree. I already know the locations of the logs on-disk, and ripgrep - or at worst, grep with LC_ALL=C - is much, much faster than any aggregation tool.

If I need to compare different machines, or do complex projections, then sure, external tooling is probably easier. But for the case of “I know roughly when a problem occurred / a text pattern to match,” reading the local file is faster.


I'll write a one-off shell pipeline to inspect something on 10^5 servers - it will be sent to each of those servers and run once or a handful of times, and the results will be transmitted back and that's that. Kind of a map-reduce shell thing, for ops type tasks.

Sometimes those will actually need to process through a bunch of data unexpectedly.

Sometimes those will be run on a loop - once per second, N per minute (etc), and the results will be used to monitor a situation until a bug is fixed or a spike in load is resolved or a proper monitoring program/metric can be deployed.

Sometimes those are to investigate a pegged CPU and the amortized lower runtime across all the tasks on the CPU is noticable.

We run our machines hot and part of the reason we can do that is being in the habit of choosing lower cost (in cycles) tooling whenever we can. If i can spend a little time and effort learning a tool that saves a bunch of cpu in aggregate, its a win. When the whole company does it, we can spend a lot less on hardware than it costs in engineer time to make these decisions.

Another way of putting it is: its a type of frugality (not cheapness, just spending wisely). If you save a dollar once, its nothing. If you have a habit of saving a dollar every time the opportunity arises, it adds up quickly. By having a habit of choosing more performant tools, you're less likely to hit a case where you wish you did use more performant tools, and are practiced at it when the need arises for pure parsimony and it's less painful.


We parse JSON responses for dashboards, alerting, etc. Thousands of nodes, depending on the resolution of your monitoring you could see improvements here.

For people chewing through 50GB logs or piping JSON through cron jobs all day, a 2x speedup is measurable in wall time and cloud bill, not just terminal-brain nonsense. Most people won't care.

If jq is something you run a few times by hand, a "faster jq" is about as compelling as a faster toaster. A lot of these tools still get traction because speed is an easy pitch, and because some team hit one ugly bottleneck in CI or a data pipeline and decided the old tool was now unacceptable.


It's a simple loop:

- Someone likes tool X

- Figures, that they can vibe code alternative

- They take Rust for performance or FAVORITE_LANG for credentials

- Claude implements small subset of features

- Benchmark subset

- Claim win, profit on showcase

Note: this particular project doesn't have many visible tells, but there's pattern of overdocumentation (17% comment-to-code ratio, >1000 words in README, Claude-like comment patterns), so it might be a guided process.

I still think that the project follows the "subset is faster than set" trend.


You don't know something is slow until you encounter a use case where the speed becomes noticeable. Then you see the slowness across the board. If you can notice that a command hasn't completed and you are able to fully process a thought about it, it's slow(er than your mind, ergo slow!).

Usually, a perceptive user/technical mind is able to tweak their usage of the tools around their limitations, but if you can find a tool that doesn't have those limitations, it feels far more superior.

The only place where ripgrep hasn't seeped into my workflow for example, is after the pipe and that's just out of (bad?) habit. So much so, sometimes I'll do this foolishly rg "<term>" | grep <second filter>; then proceed to do a metaphoric facepalm on my mind. Let's see if jg can make me go jg <term> | jq <transformation> :)


Well grep is just better sometimes. Like you want to copy some lines and grep at the end of a pipeline is just easier than rg -N to suppress line numbers. Whatever works, no need to facepalm.

Not every use case of jq is a person using it interactively in their terminal, believe it or not.

If somebody needs performance, they probably shouldn't be calling out to a separate process for json of all things, no?

(Honestly, who even still writes shell scripts? Have a coding agent write the thing in a real scripting language at least, they aren't phased by the boilerplate of constructing pipelines with python or whatever. I haven't written a shell script in over a year now.)


If you’re writing the script to be used by multiple people, or on multiple systems, or for CI runners, or in containers, etc. then there’s no guarantee of having Python (mostly for the container situation, but still), much less of its version. It’s far too easy to accidentally use a feature or syntax that you took for granted, because who would still be using 3.7 today, anyway? I say this from painful recent experience.

Plus, for any script that’s going to be fetching or posting anything over a network, the LLM will almost certainly want to include requests, so now you either have to deal with dependencies, or make it use urllib.

In contrast, there’s an extremely high likelihood of the environment having a POSIX-compatible interpreter, so as long as you don’t use bash-isms (or zsh-isms, etc.), the script will probably work. For network access, the odds of it having curl are also quite high, moreso (especially in containers) than Python.


If you're distributing the script to other people then the benifit of using python and getting stuff like high quality argument parsing for free is even greater.

If Ms performance is a main concern, you shouldn't use jq. Believe it or not.

Race between ripgrep and ugrep is entertaining.

Optimization = good

Prioritizing SEO-ing speed over supporting the same features/syntax (especially without an immediately prominent disclosure of these deficiencies) = marketing bullshit

A faster jq except it can't do what jq does... maybe I can use this as a pre-filter when necessary.


Speed is a quality in itself. We are so bugged down by slow stuff that we often ignore that and don’t actively search for another.

But every now and then a well-optimised tool/page comes along with instant feedback and is a real pleasure to use.

I think some people are more affected by that than others.

Obligatory https://m.xkcd.com/1205


I am not sure if it was simon or pg who might've quoted this but I remembered a quote about that a 2 magnitude order in speed (quantity) is a huge qualititative change in it of itself.

Hi, creator of Cloudhiker here. Thanks for mentioning my site! Let me know if you have any questions, issues or ideas.


I'm really not into math and got really lost in the second half of "Adding points on a curve". Just don't understand what the author wants to tell me with the grouping and the role of the identity element, which is called infinity but is zero?

However, after looking at the next section and playing with the chart I immediately got the idea where the whole article is heading. Interesting to see how this works.


There is a slight bug on the interaction. When you set P=Q or for example you can't get the one P at the top and Q at the bottom. The lines disappear.

Basically you need the "infinite/zero" point to compensate for a situation when you have two points completely perpendicular to the x-axis. AKA it is not intersecting a third point. So it intersects this special "infinite" point.

And conceptually why you need this "infinite" point is that without it you can't add points together properly.

Say for counter argument instead of doing this "flip or mirror" across the x-axis (in the interaction it is the red dot appearing). And instead the red dot just appears on the same side as the two points being added on the curve - without the flipping.

If P1+P2 = Q instead of this Q' that is flipped. And P2+Q = P1

If you try and add P1+P2+Q you would get either Q+Q or P1+P1 depending on if you did (P1+P2)+Q or adding up P1+(P2+Q) which are not equal.

so you need this red dot flipping thing happening in the interaction. However, if you have this flipping that means P1+P2 = Q' which is the mirror flip of Q.

So Q'+Q need to equal this special infinite/zero point to ensure associativity works.


Just to toss on some info you might already know, the mention about grouping is related to group theory. [0] If a set satisfies those 3 axioms, there's some assertions you can build off that are common to all group theory sets, and having an identity element is one of them. It's weird that it's NOT zero, but in this case, infinity behaves LIKE zero. (Imagine going infinitely along the curve on the x-axis towards the open part of the curve, so therefore going infinitely up/down the y-axis. At somepoint, you're essentially have a vertical line between the original point, and your infinitely far away point, which points at the exact opposite side of the curve, which reflects back to the original point.) For natural numbers, zero is the identity, since X + 0 = X, in the same way P + infintelyfarawaypoint = P in this set.

To use a dumb analogy, it's polymorphism where your interface is something like regular old natural numbers: as long as your class behaves like natural numbers in some key ways, you can pass them to any add()/subtract()/multiply() functions relying on that behaviour.

[0] https://en.wikipedia.org/wiki/Group_(mathematics)#Definition


Unfortunately, I can second this, both as a developer and a user. His IMHO childish behavior has ruined his image for me, and is not a good lighthouse for the Fediverse itself. Also, as a OSS veteran myself, I see it extremely critical that he is starting new projects all along, denies to get proper help and build up a maintainer team, and leaves older projects in the dust. Pixelfed is the one product he might should focus on, jet it feels like the platform is in maintenance only mode. Pixelfed is a wonderful addition to the Fediverse and deserves to be on good hands.

Maybe, and this is a very personal opinion, his product success and the Kickstarter campaign raising over 100k made him feel like he's better than everybody else. And one can see the effects.


The sad part is, that Apple used to make somewhat stable, functional software. I started with the iPhone 3 and a bit later with Mac OS Snow Leopard. It all started when Mr Cook decided to serve the shareholders, instead of focusing on Apple's core values. The software went downhill in such a speed in just a few years. And moving out of the ecosystem is a painful, if not unbearable, task that barely anyone loves to do. At least I can't even think about moving back to Android.


I recently tested Swiftkey after Typewise is sadly abandoned. It's sooooo much better than the stock keyboard. Not only is the auto-correct working incredibly well (garbage like witjoit is correctly transformed to without, which Apple Keyboard can't), Swiftkey also manages multi-language typing astonishingly well. Last but not least, I can customize it. I am also not signed in to my account, so no settings or whatever is stored on Microsoft servers.


I consider moving away from Github, but I need a solid CI solution, and ideally a container registry as well. Would totally pay for a solution that just works. Any good recommendations?


We can run a Forgejo instance for you with Firecracker VM runners on bare metal. We can also support it and provide an SLA. We're running it internally and it is very solid. We're running the runners on bare metal, with a whole lot of large CI/CD jobs (mostly Rust compilation).

The down side is that the starting price is kinda high, so the math probably only works out if you also have a number of other workloads to run on the same cluster. Or if you need to run a really huge Forgejo server!

I suspect my comment history will provide the best details and overview of what we do. We'll be offering the Firecracker runner back to the Forgejo community very soon in any case.

https://lithus.eu


You've got any docs for firecracker as forgejo runners?


Ping me an email, adam@ domain.

If you're interested I'll see about getting the PR created sooner rather than later.


I actually went through some of the issue/pr stuff for the forgejo project after I asked you. It seems like things are moving along nicely and you seem to have found a welcoming environment in their repo. I will keep an eye on that progress. Thanks very much. I do not have a pressing need but firecracker runners would be pretty awesome to have.



Awesome. Thank you for letting me know and good luck with the PR!


Long time GitLab fan myself. The platform itself is quite solid, and GitLab CI is extremely straightforward but allows for a lot of complexity if you need it. They have registries as well, though admittedly the permission stuff around them is a bit wonky. But it definitely works and integrates nicely when you use everything all in one!


Should our repos be responsible for CI in the first place? Seems like we keep losing the idea of simple tools to do specific jobs well (unix-like) and keep growing tools to be larger while attempting to do more things much less well (microsoft-like).


I think most large platforms eventually split the tools out because you indeed can get MUCH better CI/CD, ticket management, documentation, etc from dedicated platforms for each. However when you're just starting out the cognitive overhead and cost of signing up and connecting multiple services is a lot higher than using all the tools bundled (initially for free) with your repo.



Why this and not Garnix?


Lots of dedicated CI/CD out there that works well. CircleCI has worked for me


GitLab can be selfhosted with container based CI and fairly easy to setup CE


CE is pretty good. The things that you will miss that made us eventually pay:

* Mandatory code reviews

* Merge queue (merge train)

If you don't need those it's good.

Also it's written in Ruby so if you think you'll ever want to understand or modify the code then look elsewhere (probably Forgejo).


GitLab has all the things.


Gitea / forgejo. It supports GitHub actions.


GitLab, best ci i’ve ever used.


I have searched for a proper keyboard replacement. But there is not a single one that 1) works properly without major bugs, 2) adheres to privacy standards (because no way I am sending all keystrokes to Microslop) and 3) does not cost a fortune. Typewise came close, but seems abandoned now.

If anyone has a recommendation, please reply.


I really wish Apple would allow us to swap out the Finder with something else, so files open in that other app instead of the Finder. This works reasonable well on Windows, where I "replaced" the Explorer with Directory Opus.


This used to be possible, I remember that I replaced Finder with some other app many years ago. I strongly assume that this doesn't work any more, though.


Yeah. Path Finder was a common power user tool.

I recall you used to be able to flip some bit somewhere to allow you to Quit the Finder, but I assume that's disappeared inside the encrypted and signed partition where Apple keeps all the things us stupid users shouldn't be allowed to touch.

But even then, you'd want more than just that, as when you tell the OS to "Reveal" a file or open a folder, that's the association I'd want to be able to change.

Honestly I'd really prefer the Windows XP File Explorer to the pile of crap the Finder has turned into.


  > I recall you used to be able to flip some bit somewhere to allow you to Quit the Finder
you can still do this with a hidden preference using command line:

  defaults write com.apple.finder QuitMenuItem -bool true; killall Finder
[0] https://www.defaults-write.com/adding-quit-option-to-os-x-fi...

  > But even then, you'd want more than just that, as when you tell the OS to "Reveal" a file or open a folder, that's the association I'd want to be able to change.
yep, that should just be a normal setting like default browser (one thing i like about linux nowadays)


TinkerTool is a nice GUI app for this setting and a few others, see https://www.bresink.com/osx/TinkerTool.html


oh wow its been a long time; i forgot about that one, i'll have to download that again


Ah yes, thank you for reminding me - of course, it was Path Finder! You could even have it respond to "Reveal". Not sure any more if it was by renaming Path Finder to /Applications/Finder, or by changing its Bundle id to com.apple.finder, or some other trick.


"of course, it was Path Finder! You could even have it respond to 'Reveal'"

This still works. I have been using macs since 1985 and have always hated the Finder. In the days of classic Mac OS, my go to for file management was a desk accessory called DiskTop, which was great. Super fast and easy to operate from the keyboard.

When I switched to OSX, I needed something better than the Finder, chose Path Finder, and have been using it ever since. I have my complaints about it but have not been able to find anything I like better.


Sorry for the shameless plug, but I built [Cloudhiker](https://cloudhiker.net) exactly for this: exploring great websites.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: