Off topic discussion • Re: A bit of Raspberry with 144C/288T

Super linear speedup is stuff of dreams.

I know, a (new) sequential algorithm could simulate the parallel, and result in only linear speedup.

According to Shy, super linear speedup means the serial algorithm did not consider cache blocking and other optimisations which happened as a side effect in the parallel code.

Good potential explanation — I did a parallel run with openmp and numactl forcing on a single core as sequential run, so openmp obviously does not do such optimizations from alone for single omp thread execution.

I started with omp code for parallel best insert.
First sequential version here:
https://github.com/Hermann-SW/RR/blob/m ... rt_seq.cpp

It contains huge arrays at bottom outside of view when editing, for mona-lisa100K coordinates and optimal tour.
And min/max values for all tested scenarioes, which were determined with sequential code and are asserted now.
Only that way I can be sure that parallel code does the right thing (it does for max on initial commit, but not for min).

I implemented double linked random access list with pred[] and succ[] arrays.
For all 100,000 cities s

that city is ruined
best (or worst) insert (value only) is determined
value is asserted against stored 100,000 min/max values
s is reinserted into doubly linked list for next iteration

Then I created new best_insert.cpp openmp code from it:
https://github.com/Hermann-SW/RR/blob/m ... insert.cpp

For minimum reduction openmp 4.0 is needed — this posting is worth a read;
https://stackoverflow.com/questions/282 ... y#28276420

With OpenMP 4.0 it's possible to use user-defined reductions. A user-defined minimum reduction can be defined like this ...

Luckily my Pi5's gcc 12.2 does fully implement even openmp 4.5:

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ gcc --version | head -1gcc (Debian 12.2.0-14+deb12u1) 12.2.0pi@raspberrypi5:~/RR/tsp/openmp $

https://gcc.gnu.org/onlinedocs/gcc-12.2 ... MP-4_002e5

The OpenMP 4.5 specification is fully supported.

The example code determines minimal value and position where minimum occurs.
I only need the value, because I will sequentially run though the tour and take first position with that value.
That is fast and guarantees equivalent execution to sequential code.
I changed the Compare struct for that.

Specifying -DDOMAX does maximum determination (worst insert) instead of minimum best insert.
In intial commit that does work, but not without that define, debugging needed.

I used cpplint for that code, but my 3GHz Pi5

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ freqmin=cur=3000000=maxpi@raspberrypi5:~/RR/tsp/openmp $

does take a loooong time:

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ time cpplint --filter=-legal/copyright best_insert.cppDone processing best_insert.cppreal2m52.923suser2m52.789ssys0m0.028spi@raspberrypi5:~/RR/tsp/openmp $

Reason are the three lines with the huge arrays at the end ;-)

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ tail -n-5 best_insert.cpp | head -1 | wc --char588918pi@raspberrypi5:~/RR/tsp/openmp $ tail -n-3 best_insert.cpp | head -1 | wc --char1269338pi@raspberrypi5:~/RR/tsp/openmp $ tail -n-1 best_insert.cpp | head -1 | wc --char1059982pi@raspberrypi5:~/RR/tsp/openmp $

Since this is code running from L3 cache, I developed it on my Pi5 — nice.
CPU percent is 99 when run sequentially on a single of its 4 cores, but 393(!) when run on 4:

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ g++ -DDOMAX -O3 -Wall -Wextra -pedantic best_insert.cpp -fopenmppi@raspberrypi5:~/RR/tsp/openmp $ pi@raspberrypi5:~/RR/tsp/openmp $ OMP_PROC_BIND=true numactl -C 3 time ./a.out 119.70user 0.01system 1:59.77elapsed 99%CPU (0avgtext+0avgdata 5808maxresident)k0inputs+0outputs (0major+209minor)pagefaults 0swapspi@raspberrypi5:~/RR/tsp/openmp $ pi@raspberrypi5:~/RR/tsp/openmp $ OMP_PROC_BIND=true numactl -C 0-3 time ./a.out 121.79user 0.04system 0:30.97elapsed 393%CPU (0avgtext+0avgdata 4816maxresident)k0inputs+0outputs (1major+213minor)pagefaults 0swapspi@raspberrypi5:~/RR/tsp/openmp $

Since no assert happened we know that ruin and worst insert value determination for each of the 100,000 cities worked.

Same code on my 16C/32T AMD 7950X CPU, here 99% versus 1598% CPU:

Code:

hermann@7950x:~/RR/tsp/openmp$ OMP_PROC_BIND=true numactl -C 0-15 time ./a.out61.51user 0.03system 0:03.85elapsed 1598%CPU (0avgtext+0avgdata 5888maxresident)k448inputs+0outputs (1major+532minor)pagefaults 0swapshermann@7950x:~/RR/tsp/openmp$ OMP_PROC_BIND=true numactl -C 15 time ./a.out49.31user 0.00system 0:49.32elapsed 99%CPU (0avgtext+0avgdata 6912maxresident)k0inputs+0outputs (0major+482minor)pagefaults 0swapshermann@7950x:~/RR/tsp/openmp$

In addition to minimum not working currently, I had to comment out an assert for DOMAX that works fine sequentially:

So debugging is needed.
But it is nice to see that (linear) speedup is seen even in pure cache scenario.

Statistics: Posted by HermannSW — Sun Sep 21, 2025 10:19 pm

Off topic discussion • Re: A bit of Raspberry with 144C/288T

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...