'Booksim'에 해당되는 글 25건

Weekly report (10/5)

Lab.work 2009. 10. 12. 22:33
for report, 

@ booksim 

25% hotspot(uniform random) traffic is saturated when injection rate=0.05~0.06, 
and 50% hotspot traffic is saturated when injection rate=0.03. 

the average latency of packets from src=0,4,8,12 to dest!=0: 
voq works better than no-voq at the saturation point. 
The results at other injection rates seem meaningless. (voq_hotspot.xlsx). 

@ phoenix 
I added the page that explains the differences between phoenix ver1.0 vs. ver2.0 
And tested the actual processing time of wordcount application with ruby. 
Wordcount with 16core takes 2~3hours to finish with ruby(p2p default).


'Lab.work' 카테고리의 다른 글

Weekly Report (10/20)  (0) 2009.10.21
Weekly Report (10/12)  (0) 2009.10.12
Weekly report (9/28)  (0) 2009.09.29
Progress Report (9/15)  (0) 2009.09.15
Progress Report (9/8)  (0) 2009.09.08
블로그 이미지

민둥

,

Weekly report (9/28)

Lab.work 2009. 9. 29. 00:36
Hi. 

First of all, I apologize for late report every week :'( 
I will be more punctual next time. 

@ mapreduce ppt 

The PPT is almost done except for the summary part. 
I thought about what to write, but I think I need to discuss about the new results. 

I spent time to explain the mapreduce concept first, 
And compared the result in the paper with my test result. (ver1.0 and ver2.0 on sunT2 machine) 
Then, I added the dynamic load balancing and execution time breakdown. 
And I divided the test configurations as 4 part, which are input size, thread distribution, cache size, and page size. 

Actually, I tested page size this week, and found that: 
When I changed the page size to 64K (default is 8K), there was no degradation for wordcount. 

So I am testing if I change the page size, 
whether the degradation will disappear on simics result. (where there was degradation after 16 nodes) 

@ Booksim 

I printed the buffer size for each cycle, and found that the buffers of voq were saturated much faster than no-voq. 
I thought that it was because the total amount of packets of voq is more than no-voq. 

So I printed the latency of each result only whey the cycle=5000, 2000. 
then, the voq finally printed out better result than no-voq. 
(and I checked that voq is better than no-voq for the same cycles. 
But final # cycle of voq is twice more than no-voq because of the # of packets.) 

I will test with varying buf size and compare the results. 

Thanks. 
minjeong

'Lab.work' 카테고리의 다른 글

Weekly Report (10/12)  (0) 2009.10.12
Weekly report (10/5)  (0) 2009.10.12
Progress Report (9/15)  (0) 2009.09.15
Progress Report (9/8)  (0) 2009.09.08
PROGRESS REPORT (9/1)  (0) 2009.09.01
블로그 이미지

민둥

,

Progress Report (9/15)

Lab.work 2009. 9. 15. 15:24
■ THIS WEEK  (9/8~9/14)

□ CS 710 project

I compared the cpi of IDEAL, MESH, PT_TO_PT.
Where PT_TO_TO is the topology that garnet offers defaultly (it is almost ideal).
MESH is the 4x4 mesh topology, 
IDEAL uses the same link latency and config with MESH but connected in a p2p manner.

For 3 application: barnes, cont-ocean, noncont-ocean,
The cpi result is PT_TO_PT < IDEAL < MESH
But the gap between them is not so big.
[mesh_vs_ideal.xlsx]

TODO: i think that the network topology does not seriously affect to the performance in splash2 applications
We might need to test specjbb or other and compare the results.

□ Booksim

I plotted 3-D graph of accepted packets for each nodes, but couldn’t find the specific pattern.
[booksim_hotspot.xlsx]

TODO: I want to fix the trafficmanager code to inject the same amount of packets for voq and no-voq.
and then, compare the results.

□ Phoenix

I changed the wordcount_map function to count all words.
(before, it only counted the words which consist of a-z)

The speedup pattern of map phase is not changed,
But the reduce time hugely increases.
From the result, the reduce processing time depends on the amount of words.

Actually, 
For the mapreduce version of wordcount, map threads store all words to the intermediate queue,
after that, reduce threads read words from the intermediate queue.
For the pthreads version, there is no intermediate queue between map and reduce. 
So, every word passes map and reduce at a time.
[test_on_gems.xlsx]

TODO: I think I need to check the size of intermediate queue, 
and check the latency of store, read words.

'Lab.work' 카테고리의 다른 글

Weekly report (10/5)  (0) 2009.10.12
Weekly report (9/28)  (0) 2009.09.29
Progress Report (9/8)  (0) 2009.09.08
PROGRESS REPORT (9/1)  (0) 2009.09.01
PROGRESS REPORT (8/26)  (0) 2009.08.26
블로그 이미지

민둥

,

Progress Report (9/8)

Lab.work 2009. 9. 8. 14:57
■ THIS WEEK  (9/1~9/7)

□ Booksim

i tested the same process with randperm traffic, but the result is almost similar.
now i am trying to print out the hot spot latency and other latency, and compare them.

□ Phoenix

* degradation of map time
i thought that the reason of degradation was the disk io time.
for shared memory system, cache miss rate increases when # of processors increases.

* test with the sunT2 stats
using 8KB private L1 data cache for each processor and 4MB shared L2 cache.
the speedup degradation appears after 16 cores.
and the map time usually dominates the whole processing time.

* test with the perfect cache
using 100MB private L1 data cache and 400MB shared L2 cache. (other configs are the same as sunT2)
the L1 read hit rate of each processor is almost 99% (it's not 100% because of the cold misses)
but the write hit rate slightly decreases as the number of the processors increases.
the overall performance of the processors increases but the pattern of the speedup is the same.
there still exists the degradation after 16 cores and the map time is a dominative part.

i think that i need to test other application and compare the results.
and i try to focus on working set size of each application.

'Lab.work' 카테고리의 다른 글

Weekly report (9/28)  (0) 2009.09.29
Progress Report (9/15)  (0) 2009.09.15
PROGRESS REPORT (9/1)  (0) 2009.09.01
PROGRESS REPORT (8/26)  (0) 2009.08.26
PROGRESS REPORT (8/17)  (0) 2009.08.17
블로그 이미지

민둥

,

PROGRESS REPORT (9/1)

Lab.work 2009. 9. 1. 09:38
■ THIS WEEK  (8/27~9/1)

□ Booksim

* 50% of packet; dest 0, 50%: uniform random
voq throughput never exceeds the no voq result.
i counted the number of time when a vc was full and no flit couln't be sent to the vc.
using voq, many flits need to wait for 0 vc.
when bufsize is infinity, throughput of voq is also low.
because the waiting time at 0 vc is very long. 

□ Phoenix

* degradation of map time
i think that the reason of degradation is because of the disk io time.
for shared memory system, cache miss rate can increases when # of processors increases.
when input size gets bigger, the gradation can become large.
actually, L1 cache miss rate increases after 8 threads.

'Lab.work' 카테고리의 다른 글

Progress Report (9/15)  (0) 2009.09.15
Progress Report (9/8)  (0) 2009.09.08
PROGRESS REPORT (8/26)  (0) 2009.08.26
PROGRESS REPORT (8/17)  (0) 2009.08.17
PROGRESS REPORT (8/10)  (0) 2009.08.11
블로그 이미지

민둥

,

PROGRESS REPORT (8/26)

Lab.work 2009. 8. 26. 10:17
■ THIS WEEK  (8/17~8/26)

□ Booksim

* single router
single router has 5 inports and 5 outports, so it only has 0~4 source and 0~4 destination.
# of vcs does not affect the result when using no VOQ.

and when # of vcs is 5, the result is very same with the result of 64vcs.
I think because the single router has only 5 destinations (not 64 dests), 
the result will be the same when # of vcs >= 5.
as you said, # of vcs for VOQ should be equal to the # of ports.

* mesh88 uniform
i changed trafficmanagner to use VOQ for injection. so there is no blocking any more.
overall throughput slightly increases, but the graph pattern remains the same.

□ Phoenix

pthread speedup is almost same with the # of processors when network size is < 24.
but the degradation rate is all different using 32, 64 cpus.

for proc = 1 and for some applications, mapreduce takes significantly more time than pthread does.
i am verifying whether the mapreduce code works correctly.

=============================================================

■ NEXT WEEK 

□ Phoenix
test all application and verify whether the mapreduce code works correctly.

□ Booksim
determine why VOQ (single) does poorly on multiflit network.

'Lab.work' 카테고리의 다른 글

Progress Report (9/8)  (0) 2009.09.08
PROGRESS REPORT (9/1)  (0) 2009.09.01
PROGRESS REPORT (8/17)  (0) 2009.08.17
PROGRESS REPORT (8/10)  (0) 2009.08.11
PROGRESS REPORT (8/3)  (0) 2009.08.04
블로그 이미지

민둥

,

PROGRESS REPORT (8/17)

Lab.work 2009. 8. 17. 15:49
■ THIS WEEK  (8/11~8/16)

□ Phoenix 2.0
I had a simple test of wordcount on simics.
There is still a degradation after 64 nodes.
But, the speedup increases up to ~250 times (using 64 nodes).
[test_on_gems.pdf]

□ Booksim 2.0
I ran booksim with bimodal uniform traffic. 
using 64 vcs for each destination (8x8 mesh), the throughput is lower than original one.
When buffer size is 1, the decrease of throughput is huge.
When buffer size is getting bigger, the degradation get smaller, but never exceeds the result which does not use destination info.
I not quite sure, but I think it is because of the under-utilization of vcs.
[booksim_bimodal_voq.pdf]

□ sigMP
I presented the paper at the sigMP seminar.
A Novel Cache Architecture with Enhanced Performance and Security (MICRO 2008)


■ NEXT WEEK  (8/17~8/23)

□ Phoenix 
run all applications on gems with different configurations.

□ Booksim 2.0
Test more with different voq options.
Implement the outqueue router and analysis the results.

'Lab.work' 카테고리의 다른 글

PROGRESS REPORT (9/1)  (0) 2009.09.01
PROGRESS REPORT (8/26)  (0) 2009.08.26
PROGRESS REPORT (8/10)  (0) 2009.08.11
PROGRESS REPORT (8/3)  (0) 2009.08.04
PROGRESS REPORT  (0) 2009.07.28
블로그 이미지

민둥

,

PROGRESS REPORT (8/10)

Lab.work 2009. 8. 11. 01:02
■ THIS WEEK  (8/4~8/10)

□ Phoenix 2.0
I ran pheonix on gems (simics + ruby).
I just tested wordcount and pca.
Both applications still have a few assertion error to fix, but they worked.
But, with ruby, it takes too much time to watch the process.

□ Booksim 2.0
i changed the Bernoulli function and made bimodal injection function.
the result is as below.
(booksim_uniform_bimodal.pdf)

And I draw the traffic patterns correctly.
But we still need to lookup booksim code and analyze each traffic pattern.

□ sigMP
A Novel Cache Architecture with Enhanced Performance and Security (MICRO 2008)


■ NEXT WEEK  (8/11~8/17)

□ Phoenix 
run all applications on gems with different configurations.

□ Booksim 2.0
read code in detail and understand the process. 
test and compare the result, and analysis the characteristic of each traffic pattern.


'Lab.work' 카테고리의 다른 글

PROGRESS REPORT (8/26)  (0) 2009.08.26
PROGRESS REPORT (8/17)  (0) 2009.08.17
PROGRESS REPORT (8/3)  (0) 2009.08.04
PROGRESS REPORT  (0) 2009.07.28
PROGRESS REPORT  (0) 2009.07.21
블로그 이미지

민둥

,

PROGRESS REPORT (8/3)

Lab.work 2009. 8. 4. 10:27
■ THIS WEEK  (7/28~8/3)

□ Phoenix 2.0
I analyzed the result of histogram.
average task processing time are increases as # of cores increase.
but, it does not increase dramatically as compared with wordcount.
(new_result_histogram.pdf)

□ Booksim 2.0
I read the book, and drew each traffic patterns.
I'm not sure this is correct, but there are some strange things.
Bit complement (D_i = ~S_i) and Bit reverse (D_i = S_b-i-1) look exactly the same.
for Shuffle(D_i = S_i-1 mod b) and Bit rotation(D_i = S_i+1 mod b), only the traffic direction is different.
and when mesh size is 4x4, transpose and neighbor are the same.
(traffic_patterns.jpg)

we are checking these features with the results.

□ sigMp
no sigMp seminar this week.


■ NEXT WEEK  (8/4~8/10)

□ Phoenix 
run a couple of applications more and compare the result with wordcount.

□ Booksim 2.0
test and compare the result, and analysis the characteristic of each traffic pattern.

□ sigMp
Jaehong asked me to cover for his turn next week (8/14).
so please introduce me a paper for the presentation.

'Lab.work' 카테고리의 다른 글

PROGRESS REPORT (8/17)  (0) 2009.08.17
PROGRESS REPORT (8/10)  (0) 2009.08.11
PROGRESS REPORT  (0) 2009.07.28
PROGRESS REPORT  (0) 2009.07.21
PROGRESS REPORT  (0) 2009.07.07
블로그 이미지

민둥

,
■ Random traffic
모든 source는 모든 destination으로 1/N traffic
load balance가 매우 좋음 (매우 안좋은 topology나 routing의 경우에도 random traffic에서는 좋게 보일 수 있음)

■ Permutation
각각의 source는 모든 traffic을 하나의 destination으로 보냄

   □ Bit permutation 

Bitcomp

Bitrev

Bitrotation

Shuffle

Transpose

0000

1111(15)

0000(0)

0000(0)

0000(0)

0000(0)

0001

1110(14)

1000(8)

1000(8)

0010(2)

0100(4)

0010

1101(13)

0100(4)

0001(1)

0100(4)

1000(8)

0011

1100(12)

1100(12)

1001(9)

0110(6)

1100(12)

0100

1011(11)

0010(2)

0010(2)

1000(8)

0001(1)

0101

1010(10)

1010(10)

1010(10)

1010(10)

0101(5)

0110

1001(9)

0110(6)

0011(3)

1100(12)

1001(9)

0111

1000(8)

1110(14)

1011(11)

1110(14)

1101(13)

1000

0111(7)

0001(1)

0100(4)

0001(1)

0010(2)

1001

0110(6)

1001(9)

1100(12)

0011(3)

0110(6)

1010

0101(5)

0101(5)

0101(5)

0101(5)

1010(10)

1011

0100(4)

1101(13)

1101(13)

0111(7)

1110(14)

1100

0011(3)

0011(3)

0110(6)

1001(9)

0011(3)

1101

0010(2)

1011(11)

1110(14)

1011(11)

0111(7)

1110

0001(1)

0111(7)

0111(7)

1101(13)

1011(11)

1111

0000(0)

1111(15)

1111(15)

1111(15)

1111(15)


        ㅁ Bit complement : D_i = ~S_i
        ㅁ Bit reverse : D_i = S_b-i-1 (4x4 mesh에서 b=16)
           
[bitcomp]

[bitcomp]

[bitrev]

[bitrev]


        ㅁ Bit rotation : D_i = S_i+1 mod b
        ㅁ Shuffle : D_i = S_i-1 mod b
        ㅁ Transpose : D_i = S_i+b/2 mod b
          
[bitrot]

[bitrot]

[shuffle]

[shuffle]

[shuffle]

[transpose]



   □ Digit permutations
        ㅁ Tornado : D_x = S_x+(k/2-1)mod k (4x4 mesh에서 k=4)
             
[tornado]

[tornado]

        ㅁ Neighbor : D_x = S_x + 1 mod k
             
[neighbor]

[neighbor]



'Architecture' 카테고리의 다른 글

Seminar - Dec 28 (Mon)  (0) 2009.12.29
GARNET  (0) 2009.09.15
Virtual-Channel Flow Control  (0) 2009.07.08
Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
블로그 이미지

민둥

,