Off topic discussion • Re: A bit of Raspberry with 144C/288T

@ejolson
First I have to thank you again on insisting that calculating the same distance function again and again is faster that accessing the 18.3GB distance matrix. More on thanking further below. The explanation seems to be easy. Accessing "D[a][⁠b]" for distance from city a to city b is most likely not in cache and has to be read from Ram. On the other hand the "only" 100,000 elements array "std::vector<coord_t> CC;" with "typedef std::pair<double, double> coord_t;" of size 16 bytes is 1,600,000 bytes in total, and seems even to fit into the 2MB shared L3 cache of Pi5. It definitely fits into 64MB/35MB/45MB L3 caches of AMD 7950X/Xeon 2680v4/Xeon 8880v3 GPUs.

In my opinion 32-bit indices rather than 64-bit pointers would lead to a more significant benefit. Scratchy hissed and refused on grounds that paradigmatic C should use pointers wherever possible.
I switched 32-bit indices for the 64-bit pointers and ended up with a 3 percent slowdown when running on the Xeon.

It is unlikely to see advantages with 32bit pointers on 64bit CPUs. You would need to go back to a 32bit PiOS to see advatanges.

You cannot change distance function to float, that is likely to return different optimum and the rules of mona-lisa100K.tsp need to be followed.
You are right the code doesn't strictly follow the rules for the test case. It's possible rounding sqrt to the nearest integer could be done more quickly without a typecast, but I haven't tried.
Also your own random generator does not allow to compare apples to apples. I will try to change both and get your code do exact same computation.

Your changes are too big, different data structures, random generator, real instead of int distances — I will not change your code. Instead I did implement your basic "dist(,) is better than D[][]" idea in my own code for comparison and found same improvements that you reported.

First you made me aware of 4 memory channels of my Xeon CPUs, and I bought lots of Ram to complete 2 sockets / 8 sockets with 128GB / 64GB Ram each — and now your investigation reveals that that memory is not needed ;-)

Np for me, I will stop buying new Ram, but with the 192GB of now unused 1R Ram that I could use in one or 2 compute nodes of 8 socket system, the 256GB Micron Ram from 2-socket system, the 512GB Samsung Ram from8-socket system and 5 additional 16GB Micron modules I will do 1TB+16GB Ram experiment at some point in time. Linux does not need much Ram, so 16GB should suffice, and I will try to allocate a contiguous array of size exactly 1TB(!). The biggest arrays that I allocated contiguous sofar were 64GB for computing all solutions of 39-field 3-3-2-2 peg solitaire you can play here (allocates 64GB+16GB+1GB=81GB storage on my personal website for 3-3-2-2/French/English boards):
https://stamm-wilbrandt.de/en/#peg-solitaire

I created copy all_.cpp of all.cpp and then did all the "D[a][⁠b]" -> "dist(CC[a], CC[⁠b])" changes with this commit:
https://github.com/Hermann-SW/RR/commit ... 8949e8187a

Now all_.cpp allows to compare what you proposed with using distance matrix. And your approach wins, with similar factors on AMD 7950X PC / 2-socket server. I cannot test on 8-socket system because that does factor RSA-140 since some hours ;-)

For AMD 7950X CPU single threaded (with >5.5GHz) all_.cpp is 76293824 / 45356407 = 1.68× faster:

Code:

hermann@7950x:~/RR/tsp/pthread/all$ time ./all -s 1234 ../../../data/tsp/extra/mona-lisa100K-1           init_dist() [20852036us]5757191           global minimum0: 6187059           RR_all() [76293824us]real1m37.959suser1m32.512ssys0m5.438shermann@7950x:~/RR/tsp/pthread/all$

Code:

hermann@7950x:~/RR/tsp/pthread/all$ time ./all_ -s 1234 ../../../data/tsp/extra/mona-lisa100K-1           init_dist() [0us]5757191           global minimum0: 6187059           RR_all() [45356407us]real0m45.398suser0m45.391ssys0m0.004shermann@7950x:~/RR/tsp/pthread/all$

For Xeon 2680v4 CPU single threaded all_.cpp is 274323815 / 158094868 = 1.74× faster:

Code:

hermann@E5-2680v4:~/RR/tsp/pthread/all$ time ./all -s 1234 ../../../data/tsp/extra/mona-lisa100K-1           init_dist() [44559262us]5757191           global minimum0: 6187059           RR_all() [274323815us]real5m20.344suser5m7.393ssys0m12.802shermann@E5-2680v4:~/RR/tsp/pthread/all$

Code:

hermann@E5-2680v4:~/RR/tsp/pthread/all$ time ./all_ -s 1234 ../../../data/tsp/extra/mona-lisa100K-1           init_dist() [0us]5757191           global minimum0: 6187059           RR_all() [158094868us]real2m38.221suser2m38.173ssys0m0.011shermann@E5-2680v4:~/RR/tsp/pthread/all$

Although it takes six times longer to finish, due to the seed used for the random number generator the answer is exactly the same. I wonder how the Pi 5 would fare.

Top did show this on Pi5 during computation, not much RAM needed:

Code:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    52540 pi        20   0   16944  14416   2640 R 100.0   0.3   2:31.50 all_

My 4GB Pi5 is only 351027047 / 45356407 = 7.74× slower than 5.5GHz single threaded AMD 7950X CPU.
But also 7.5× cheaper than the AMD 7950X CPU alone:

Code:

pi@raspberrypi5:~/RR/tsp/pthread/all $ time ./all_ -s 1234 ../../../data/tsp/extra/mona-lisa100K-1           init_dist() [0us]5757191           global minimum0: 6187059           RR_all() [351027047us]real5m51.186suser5m50.979ssys0m0.060spi@raspberrypi5:~/RR/tsp/pthread/all $

⁠

After loading the route from

https://www.math.uwaterloo.ca/tsp/data/ ... 57191.tour

Another thank you on finding that best known tour on uwaterloo website!
I did not read down enough on the mona lisa page.
There are 8 previously best solutions as well, now all 9 are here:
https://github.com/Hermann-SW/RR/tree/m ... /tsp/extra

Now all_.cpp with "-i" option to read a tour can immediately verify the stated tour costs in less than a second on a Pi5, here for the 2nd best known solution with 8 units more:

Code:

pi@raspberrypi5:~/RR/tsp/pthread/all $ time ./all_ -i ../../../data/tsp/extra/monalisa_5757199.tour ../../../data/tsp/extra/mona-lisa100K-1           init_dist() [0us]5757191           global minimum0: 5757199           RR_all() [0us]real0m0.099suser0m0.076ssys0m0.004spi@raspberrypi5:~/RR/tsp/pthread/all $

Regarding "thank you", after sitting in same office and working together for more than 5 years from 1995 until 2001 with my co author of the paper Gerhard Schrinpf, I contacted him two months ago, and he is in early retirement as well ;-)

We had numerous calls on Ruin and Recreate, and his Mac OS computer was the reason I had to switch the random generator to std::mt19937, because Apple botched random/drand48.

It is perfectly possible that the 5,757,191 mona lisa tour is optimal.
Even if not, it is not clear whether Gerhard and I will be able to find a better tour.
At least he now can easily work together with me on his Mac after getting rid of big distance matrices thanks to your insistance.
In case we will be able to win the $1,000 price money
https://www.math.uwaterloo.ca/tsp/data/ml/monalisa.html
I will ask you for your paypal details for a share because of all your Xeon/OpenMP/pthread/dist(,) better than D[][], ... help sofar.

My todos are now:
- getting rid of distance matrix
- pthread/OpenMP work
- see how far 16C/32T, 28C/56T and 144C/288T systems can speedup single threaded computation
- amdgpu work
- see how 1/10 Vega20 AMD GPUs (with 3,840 cores each) can help to speedup mona lisa 100,000 cities TSP computations

Statistics: Posted by HermannSW — Sun Aug 31, 2025 4:54 pm

Off topic discussion • Re: A bit of Raspberry with 144C/288T

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...