Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cosmic rays and computers (1998) (nature.com)
33 points by cfontes on Nov 27, 2020 | hide | past | favorite | 15 comments


At one time, Windows even dealt with gamma rays, as this assembly comment stated [1]:

  ;
  ; Invalidate the processor cache so that any stray gamma
  ; rays (I'm serious) that may have flipped cache bits
  ; while in S1 will be ignored.
  ;
  ; Honestly.  The processor manufacturer asked for this.
  ; I'm serious.
  ;
  invd
[1] For a brief period, the kernel tried to deal with gamma rays corrupting the processor cache (https://devblogs.microsoft.com/oldnewthing/20181120-00/?p=10...)


This issue is quite under-appreciated from my experience. I work in a field where ECC memory is an absolute must because there's a lot of money on the line and cosmic rays are a real issue.

You'd be surprised by the amount of random bit flipping happening in commodity hardware that we use every day.


Is it known/assumed that most of them are caused by cosmic rays? I can imagine there could be much less exciting factors at play, like fluctuation in voltage, temporary local overheating or whatnot.


Cosmic rays aren't the only source of SEUs (single event upsets), but they're definitely a major one. One obvious data point: A very strong correlation between altitude and SEU rate.


What's the error rate nowadays?

In the early 2000s, what I could find published suggested somewhere in the 0.2-1 bit error per GB per day.

I had a consumer desktop (no ECC) running as a home server, with plenty of spare memory, so I tried to verify that. I had a program that simply allocated 128 MB, wrote a pattern to it, and then every 60 seconds checked to see if any of the bits had changed.

At the low end of the expected error range, 0.2 bit errors per GB per day, my 128 MB pattern should have had an error on average every 40 days. Or every 80 days if the bit flipping only went one way because the pattern had an equal number of 0 bits and 1 bits.

I ran this for a couple years and never spotted an error. I never did figure out of the published error rate was too high, at least for the kind of RAM I had, or if there was something about my local environment that was making it harder for radiation that could flip bits from making it to my computer.


It's hard for me to comment on your results as I know too little about your testing method.

I'm not allowed to give our numbers to the public but there's a few articles and blog posts about the probabilities. The general observation is:

"A system on Earth, at sea level, with 4 GB of RAM has a 96% percent chance of having a bit error in three days without ECC RAM. With ECC RAM, that goes down to 1.67e-10 or about one chance in six billions"

https://community.hiveeyes.org/t/soft-errors-caused-by-singl...


Are there good solutions for reliable computation by redundancy?


Yes, at various levels of the stack. One thing worth checking out is TI's "Hercules" processor family, which is a low-cost, entry level, but very well implemented "safety processor." If uses a combination of ECC (not just on caches and main memories, but on peripheral FIFOs) and lockstep computation (running the same instruction through two separate cores in parallel, using them to cross-check each other) to convert malfunction (undefined, arbitrarily complex behavior in the case of a fault) to loss of function. At the system level, two such parts can be used in a failover model, for example, without needing complex voting logic -- anyone who's sending you valid messages must be functional.

There's other lockstep parts out there, and alternatives to lockstep -- for example, triplex redundancy converts (single) malfunction to correct function. But Hercules is well-documented, with interesting white papers, and definitely worth understanding as a starting point.


Thank you!


Oh, this reminds me of the Radiolab episode about cosmic rays from last year. Fascinating story!

https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...


Interesting presentation on real world issues with cosmic rays: http://webhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/44/W2/0...

Page 2 notes the somewhat famous set of issues in the year 2000 with some Sun Sparc models that did not have ECC on the L2 cache. It took out some pretty big players.


This comes up often when talking about ECC.

Does a Faraday cage not work against high velocity charged particles?


Cosmic rays are not just photons. A faraday cage is meant to block electromagnetic fields; it won't do anything to stop leptons.

In my university labs I remember an experiment where we were detecting cosmic ray muons that would travel through a series of stacked lead plates. You didn't have to wait long before recording an event. These particles are big and fast and you need quite a lot of shielding to block them.


A Faraday cage will also do nothing to stop an X-ray or gamma photon.


No, faraday cages do not protect you from high speed charged particles. For that you need meters of water equivalent.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: