Quantcast
Channel: Raspberry Pi Forums
Viewing all articles
Browse latest Browse all 8015

Off topic discussion • Re: A bit of Raspberry with 144C/288T

$
0
0
Super linear speedup is stuff of dreams.
I know, a (new) sequential algorithm could simulate the parallel, and result in only linear speedup.
According to Shy, super linear speedup means the serial algorithm did not consider cache blocking and other optimisations which happened as a side effect in the parallel code.
Good potential explanation — I did a parallel run with openmp and numactl forcing on a single core as sequential run, so openmp obviously does not do such optimizations from alone for single omp thread execution.


I started with omp code for parallel best insert.
First sequential version here:
https://github.com/Hermann-SW/RR/blob/m ... rt_seq.cpp

It contains huge arrays at bottom outside of view when editing, for mona-lisa100K coordinates and optimal tour.
And min/max values for all tested scenarioes, which were determined with sequential code and are asserted now.
Only that way I can be sure that parallel code does the right thing (it does for max on initial commit, but not for min).

I implemented double linked random access list with pred[] and succ[] arrays.
For all 100,000 cities s
  • that city is ruined
  • best (or worst) insert (value only) is determined
  • value is asserted against stored 100,000 min/max values
  • s is reinserted into doubly linked list for next iteration
Then I created new best_insert.cpp openmp code from it:
https://github.com/Hermann-SW/RR/blob/m ... insert.cpp

For minimum reduction openmp 4.0 is needed — this posting is worth a read;
https://stackoverflow.com/questions/282 ... y#28276420
With OpenMP 4.0 it's possible to use user-defined reductions. A user-defined minimum reduction can be defined like this ...
Luckily my Pi5's gcc 12.2 does fully implement even openmp 4.5:

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ gcc --version | head -1gcc (Debian 12.2.0-14+deb12u1) 12.2.0pi@raspberrypi5:~/RR/tsp/openmp $ 
https://gcc.gnu.org/onlinedocs/gcc-12.2 ... MP-4_002e5
The OpenMP 4.5 specification is fully supported.

The example code determines minimal value and position where minimum occurs.
I only need the value, because I will sequentially run though the tour and take first position with that value.
That is fast and guarantees equivalent execution to sequential code.
I changed the Compare struct for that.

Specifying -DDOMAX does maximum determination (worst insert) instead of minimum best insert.
In intial commit that does work, but not without that define, debugging needed.

I used cpplint for that code, but my 3GHz Pi5

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ freqmin=cur=3000000=maxpi@raspberrypi5:~/RR/tsp/openmp $ 
does take a loooong time:

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ time cpplint --filter=-legal/copyright best_insert.cppDone processing best_insert.cppreal2m52.923suser2m52.789ssys0m0.028spi@raspberrypi5:~/RR/tsp/openmp $ 
Reason are the three lines with the huge arrays at the end ;-)

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ tail -n-5 best_insert.cpp | head -1 | wc --char588918pi@raspberrypi5:~/RR/tsp/openmp $ tail -n-3 best_insert.cpp | head -1 | wc --char1269338pi@raspberrypi5:~/RR/tsp/openmp $ tail -n-1 best_insert.cpp | head -1 | wc --char1059982pi@raspberrypi5:~/RR/tsp/openmp $

Since this is code running from L3 cache, I developed it on my Pi5 — nice.
CPU percent is 99 when run sequentially on a single of its 4 cores, but 393(!) when run on 4:

Code:

pi@raspberrypi5:~/RR/tsp/openmp $ g++ -DDOMAX -O3 -Wall -Wextra -pedantic best_insert.cpp -fopenmppi@raspberrypi5:~/RR/tsp/openmp $ pi@raspberrypi5:~/RR/tsp/openmp $ OMP_PROC_BIND=true numactl -C 3 time ./a.out 119.70user 0.01system 1:59.77elapsed 99%CPU (0avgtext+0avgdata 5808maxresident)k0inputs+0outputs (0major+209minor)pagefaults 0swapspi@raspberrypi5:~/RR/tsp/openmp $ pi@raspberrypi5:~/RR/tsp/openmp $ OMP_PROC_BIND=true numactl -C 0-3 time ./a.out 121.79user 0.04system 0:30.97elapsed 393%CPU (0avgtext+0avgdata 4816maxresident)k0inputs+0outputs (1major+213minor)pagefaults 0swapspi@raspberrypi5:~/RR/tsp/openmp $ 
Since no assert happened we know that ruin and worst insert value determination for each of the 100,000 cities worked.


Same code on my 16C/32T AMD 7950X CPU, here 99% versus 1598% CPU:

Code:

hermann@7950x:~/RR/tsp/openmp$ OMP_PROC_BIND=true numactl -C 0-15 time ./a.out61.51user 0.03system 0:03.85elapsed 1598%CPU (0avgtext+0avgdata 5888maxresident)k448inputs+0outputs (1major+532minor)pagefaults 0swapshermann@7950x:~/RR/tsp/openmp$ OMP_PROC_BIND=true numactl -C 15 time ./a.out49.31user 0.00system 0:49.32elapsed 99%CPU (0avgtext+0avgdata 6912maxresident)k0inputs+0outputs (0major+482minor)pagefaults 0swapshermann@7950x:~/RR/tsp/openmp$

In addition to minimum not working currently, I had to comment out an assert for DOMAX that works fine sequentially:
maximum_reduction.assert_commented_out.png
So debugging is needed.
But it is nice to see that (linear) speedup is seen even in pure cache scenario.

Statistics: Posted by HermannSW — Sun Sep 21, 2025 10:19 pm



Viewing all articles
Browse latest Browse all 8015

Trending Articles