@ejolson
First I have to thank you again on insisting that calculating the same distance function again and again is faster that accessing the 18.3GB distance matrix. More on thanking further below. The explanation seems to be easy. Accessing "D[a][b]" for distance from city a to city b is most likely not in cache and has to be read from Ram. On the other hand the "only" 100,000 elements array "std::vector<coord_t> CC;" with "typedef std::pair<double, double> coord_t;" of size 16 bytes is 1,600,000 bytes in total, and seems even to fit into the 2MB shared L3 cache of Pi5. It definitely fits into 64MB/35MB/45MB L3 caches of AMD 7950X/Xeon 2680v4/Xeon 8880v3 GPUs.
First you made me aware of 4 memory channels of my Xeon CPUs, and I bought lots of Ram to complete 2 sockets / 8 sockets with 128GB / 64GB Ram each — and now your investigation reveals that that memory is not needed![Wink ;-)]()
Np for me, I will stop buying new Ram, but with the 192GB of now unused 1R Ram that I could use in one or 2 compute nodes of 8 socket system, the 256GB Micron Ram from 2-socket system, the 512GB Samsung Ram from8-socket system and 5 additional 16GB Micron modules I will do 1TB+16GB Ram experiment at some point in time. Linux does not need much Ram, so 16GB should suffice, and I will try to allocate a contiguous array of size exactly 1TB(!). The biggest arrays that I allocated contiguous sofar were 64GB for computing all solutions of 39-field 3-3-2-2 peg solitaire you can play here (allocates 64GB+16GB+1GB=81GB storage on my personal website for 3-3-2-2/French/English boards):
https://stamm-wilbrandt.de/en/#peg-solitaire
I created copy all_.cpp of all.cpp and then did all the "D[a][b]" -> "dist(CC[a], CC[b])" changes with this commit:
https://github.com/Hermann-SW/RR/commit ... 8949e8187a
Now all_.cpp allows to compare what you proposed with using distance matrix. And your approach wins, with similar factors on AMD 7950X PC / 2-socket server. I cannot test on 8-socket system because that does factor RSA-140 since some hours![Wink ;-)]()
For AMD 7950X CPU single threaded (with >5.5GHz) all_.cpp is 76293824 / 45356407 = 1.68× faster:For Xeon 2680v4 CPU single threaded all_.cpp is 274323815 / 158094868 = 1.74× faster:
My 4GB Pi5 is only 351027047 / 45356407 = 7.74× slower than 5.5GHz single threaded AMD 7950X CPU.
But also 7.5× cheaper than the AMD 7950X CPU alone:
I did not read down enough on the mona lisa page.
There are 8 previously best solutions as well, now all 9 are here:
https://github.com/Hermann-SW/RR/tree/m ... /tsp/extra
Now all_.cpp with "-i" option to read a tour can immediately verify the stated tour costs in less than a second on a Pi5, here for the 2nd best known solution with 8 units more:
Regarding "thank you", after sitting in same office and working together for more than 5 years from 1995 until 2001 with my co author of the paper Gerhard Schrinpf, I contacted him two months ago, and he is in early retirement as well
We had numerous calls on Ruin and Recreate, and his Mac OS computer was the reason I had to switch the random generator to std::mt19937, because Apple botched random/drand48.
It is perfectly possible that the 5,757,191 mona lisa tour is optimal.
Even if not, it is not clear whether Gerhard and I will be able to find a better tour.
At least he now can easily work together with me on his Mac after getting rid of big distance matrices thanks to your insistance.
In case we will be able to win the $1,000 price money
https://www.math.uwaterloo.ca/tsp/data/ml/monalisa.html
I will ask you for your paypal details for a share because of all your Xeon/OpenMP/pthread/dist(,) better than D[][], ... help sofar.
My todos are now:
- getting rid of distance matrix
- pthread/OpenMP work
- see how far 16C/32T, 28C/56T and 144C/288T systems can speedup single threaded computation
- amdgpu work
- see how 1/10 Vega20 AMD GPUs (with 3,840 cores each) can help to speedup mona lisa 100,000 cities TSP computations
First I have to thank you again on insisting that calculating the same distance function again and again is faster that accessing the 18.3GB distance matrix. More on thanking further below. The explanation seems to be easy. Accessing "D[a][b]" for distance from city a to city b is most likely not in cache and has to be read from Ram. On the other hand the "only" 100,000 elements array "std::vector<coord_t> CC;" with "typedef std::pair<double, double> coord_t;" of size 16 bytes is 1,600,000 bytes in total, and seems even to fit into the 2MB shared L3 cache of Pi5. It definitely fits into 64MB/35MB/45MB L3 caches of AMD 7950X/Xeon 2680v4/Xeon 8880v3 GPUs.
It is unlikely to see advantages with 32bit pointers on 64bit CPUs. You would need to go back to a 32bit PiOS to see advatanges.I switched 32-bit indices for the 64-bit pointers and ended up with a 3 percent slowdown when running on the Xeon.In my opinion 32-bit indices rather than 64-bit pointers would lead to a more significant benefit. Scratchy hissed and refused on grounds that paradigmatic C should use pointers wherever possible.
Your changes are too big, different data structures, random generator, real instead of int distances — I will not change your code. Instead I did implement your basic "dist(,) is better than D[][]" idea in my own code for comparison and found same improvements that you reported.You are right the code doesn't strictly follow the rules for the test case. It's possible rounding sqrt to the nearest integer could be done more quickly without a typecast, but I haven't tried.You cannot change distance function to float, that is likely to return different optimum and the rules of mona-lisa100K.tsp need to be followed.Also your own random generator does not allow to compare apples to apples. I will try to change both and get your code do exact same computation.
First you made me aware of 4 memory channels of my Xeon CPUs, and I bought lots of Ram to complete 2 sockets / 8 sockets with 128GB / 64GB Ram each — and now your investigation reveals that that memory is not needed
Np for me, I will stop buying new Ram, but with the 192GB of now unused 1R Ram that I could use in one or 2 compute nodes of 8 socket system, the 256GB Micron Ram from 2-socket system, the 512GB Samsung Ram from8-socket system and 5 additional 16GB Micron modules I will do 1TB+16GB Ram experiment at some point in time. Linux does not need much Ram, so 16GB should suffice, and I will try to allocate a contiguous array of size exactly 1TB(!). The biggest arrays that I allocated contiguous sofar were 64GB for computing all solutions of 39-field 3-3-2-2 peg solitaire you can play here (allocates 64GB+16GB+1GB=81GB storage on my personal website for 3-3-2-2/French/English boards):
https://stamm-wilbrandt.de/en/#peg-solitaire
I created copy all_.cpp of all.cpp and then did all the "D[a][b]" -> "dist(CC[a], CC[b])" changes with this commit:
https://github.com/Hermann-SW/RR/commit ... 8949e8187a
Now all_.cpp allows to compare what you proposed with using distance matrix. And your approach wins, with similar factors on AMD 7950X PC / 2-socket server. I cannot test on 8-socket system because that does factor RSA-140 since some hours
For AMD 7950X CPU single threaded (with >5.5GHz) all_.cpp is 76293824 / 45356407 = 1.68× faster:
Code:
hermann@7950x:~/RR/tsp/pthread/all$ time ./all -s 1234 ../../../data/tsp/extra/mona-lisa100K-1 init_dist() [20852036us]5757191 global minimum0: 6187059 RR_all() [76293824us]real1m37.959suser1m32.512ssys0m5.438shermann@7950x:~/RR/tsp/pthread/all$ Code:
hermann@7950x:~/RR/tsp/pthread/all$ time ./all_ -s 1234 ../../../data/tsp/extra/mona-lisa100K-1 init_dist() [0us]5757191 global minimum0: 6187059 RR_all() [45356407us]real0m45.398suser0m45.391ssys0m0.004shermann@7950x:~/RR/tsp/pthread/all$Code:
hermann@E5-2680v4:~/RR/tsp/pthread/all$ time ./all -s 1234 ../../../data/tsp/extra/mona-lisa100K-1 init_dist() [44559262us]5757191 global minimum0: 6187059 RR_all() [274323815us]real5m20.344suser5m7.393ssys0m12.802shermann@E5-2680v4:~/RR/tsp/pthread/all$ Code:
hermann@E5-2680v4:~/RR/tsp/pthread/all$ time ./all_ -s 1234 ../../../data/tsp/extra/mona-lisa100K-1 init_dist() [0us]5757191 global minimum0: 6187059 RR_all() [158094868us]real2m38.221suser2m38.173ssys0m0.011shermann@E5-2680v4:~/RR/tsp/pthread/all$ Top did show this on Pi5 during computation, not much RAM needed:Although it takes six times longer to finish, due to the seed used for the random number generator the answer is exactly the same. I wonder how the Pi 5 would fare.
Code:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 52540 pi 20 0 16944 14416 2640 R 100.0 0.3 2:31.50 all_ But also 7.5× cheaper than the AMD 7950X CPU alone:
Code:
pi@raspberrypi5:~/RR/tsp/pthread/all $ time ./all_ -s 1234 ../../../data/tsp/extra/mona-lisa100K-1 init_dist() [0us]5757191 global minimum0: 6187059 RR_all() [351027047us]real5m51.186suser5m50.979ssys0m0.060spi@raspberrypi5:~/RR/tsp/pthread/all $ Another thank you on finding that best known tour on uwaterloo website!
I did not read down enough on the mona lisa page.
There are 8 previously best solutions as well, now all 9 are here:
https://github.com/Hermann-SW/RR/tree/m ... /tsp/extra
Now all_.cpp with "-i" option to read a tour can immediately verify the stated tour costs in less than a second on a Pi5, here for the 2nd best known solution with 8 units more:
Code:
pi@raspberrypi5:~/RR/tsp/pthread/all $ time ./all_ -i ../../../data/tsp/extra/monalisa_5757199.tour ../../../data/tsp/extra/mona-lisa100K-1 init_dist() [0us]5757191 global minimum0: 5757199 RR_all() [0us]real0m0.099suser0m0.076ssys0m0.004spi@raspberrypi5:~/RR/tsp/pthread/all $ Regarding "thank you", after sitting in same office and working together for more than 5 years from 1995 until 2001 with my co author of the paper Gerhard Schrinpf, I contacted him two months ago, and he is in early retirement as well
It is perfectly possible that the 5,757,191 mona lisa tour is optimal.
Even if not, it is not clear whether Gerhard and I will be able to find a better tour.
At least he now can easily work together with me on his Mac after getting rid of big distance matrices thanks to your insistance.
In case we will be able to win the $1,000 price money
https://www.math.uwaterloo.ca/tsp/data/ml/monalisa.html
I will ask you for your paypal details for a share because of all your Xeon/OpenMP/pthread/dist(,) better than D[][], ... help sofar.
My todos are now:
- getting rid of distance matrix
- pthread/OpenMP work
- see how far 16C/32T, 28C/56T and 144C/288T systems can speedup single threaded computation
- amdgpu work
- see how 1/10 Vega20 AMD GPUs (with 3,840 cores each) can help to speedup mona lisa 100,000 cities TSP computations
Statistics: Posted by HermannSW — Sun Aug 31, 2025 4:54 pm