Home > Blogosphere > Kesalahan pada ISP berujung fatal terhadap raksasa internet Google

Kesalahan pada ISP berujung fatal terhadap raksasa internet Google


Kejadian sering putusnya koneksi internet merupakan suatu hal yang sangat lumrah dan harus dimaklumi serta wajib disyukuri apabila kita berdomisili di negara republik Indonesia, pasalnya dari pengalaman di tempat kerja saya dengan bergonta ganti provider penyelenggara internet tetap saja kejadian putus sambung selalu saja terjadi, padahal ISP yang digunakan merupakan pemain kakap kelas wahid dengan dedicated link via fiber optic.

Modus putusnya koneksi ini bermacam-macam, mulai dari kabel FO kecangkul Escavator, Kabel FO ketimpa Pohon, Hardware Failure, EarthQuake, dll. Kejadian Extreme yang baru saja terjadi di tempat kerja saya (tgl 6 Nov 2012), yaitu terputusnya link ke internasional melalui ISP MoraTeleMatika (Moratelindo), tim dari NOC Universitas Lampung segera mengabari kejadian terputusnya koneksi internasional ini ke Engineer Moratel. Selang beberapa jam kemudian ditanggapi pihak ISP dengan upaya penelusuran sebab musabab kejadian ini, dan finalnya mereka memutuskan solusinya adalah dengan melakukan Soft Reset perangkat Router mereka di Gedung Cyber seperti pada lampiran mail mereka berikut;

Kepada Pelanggan Yang Kami Hormati,
Dapat kami informasikan, link International saat ini sudah kembali normal.

Berikut detail gangguan yang terjadi : 
Hari /Tanggal                      : Selasa, 6 November 2012
Start Time                         : 08.10 WIB
End Time                           : 09.10 WIB
Root Cause                         : Router problem at Cyber
Impact                             : International Down
Action                             : Reset Router at Cyber
Mohon dilakukan pengecekan kembali di sisi Bapak/Ibu.
Demikian informasi yang dapat kami sampaikan . Bila ada pertanyaan lebih lanjut, dapat menghubungi Hotline kami di  021-31998600 atau dapat email ke kami di noc@moratelindo.co.id
Mohon abaikan email ini jika tidak berimpact pada link Bapak/Ibu. Terimakasih atas perhatian dan kerjasamanya.

OK , saya anggap masalah sudah selesai setelah pemberitahuan bahwa solusinya adalah dengan soft reset perangkat router, Namun pagi ini saya dikejutkan dengan posting sebuah artikel dari salah satu member forum FreeBSD-Indonesia pada link berikut http://blog.cloudflare.com/why-google-went-offline-today-and-a-bit-about, dan http://arstechnica.com/information-technology/2012/11/how-an-indonesian-isp-took-down-the-mighty-google-for-30-minutes/ yang menginformasikan bahwa telah terjadi kesalahan fatal oleh salah satu Provider di Indonesia dan berakibat pada tidak dapat diaksesnya Raksasa internet Google pada tanggal 6 Nov tersebut,  setelah saya baca dengan seksama rupanya ISP dimaksud adalah Moratelindo, akhirnya terjawab sebab musabab sesungguhnya yang mengakibatkan link internasional di Kampus terputus rupanya terjadi kebocoran ruoting (Routing Leak) disisi provider, namun yang sangat disayangkan ISP hanya mengabari mengenai softreset router, tanpa mengabari kejadian yang sesungguhnya,  ya meskipun sebenernya dalam dunia perdagangan trik ini juga bisa dimaklumi untuk menutupi  kelemahan layanan  mereka.

Rupanya di Indonesia tidak hanya terjadi kebocoran pipa gas, kebocoran pajak, kebocoran APBN, kebocoran proyek, bahkan ROUTING  pun mengalami kebocoran hehehe. Setidaknya ada nilai plus dari kejadian ini, INDONESIA menjadi lebih dikenal dunia  karena sudah Go Internasional dan berhasil membuat Raksasa internet Google tidak bisa diakses selama beberapa jam.

EOF

Kutipan posting Engineer CloudFire pada blog mereka

Today, Google’s services experienced a limited outage for about 27 minutes over some portions of the Internet. The reason this happened dives into the deep, dark corners of networking. I’m a network engineer at CloudFlare and I played a small part in helping ensure Google came back online. Here’s a bit about what happened.

At around 6:24pm PST / 02:24 UTC (5 Nov. 2012 PST / 6 Nov. 2012 UTC), CloudFlare employees noticed that Google’s services were offline. We use Google Apps for things like email so when we can’t reach their servers the office notices quickly. I’m on the Network Engineering team so I jumped online to figure out if the problem was local to us or global.

Troubleshooting

I quickly realised that we were unable to resolve all of Googles services — or even reach 8.8.8.8, Googles public DNS server — so I started troubleshooting DNS.

$ dig +trace google.com

Here’s the response I got when I tried to reach any of Google.com’s name servers:

google.com.                172800        IN        NS        ns2.google.com.
google.com.                172800        IN        NS        ns1.google.com.
google.com.                172800        IN        NS        ns3.google.com.
google.com.                172800        IN        NS        ns4.google.com.
;; Received 164 bytes from 192.12.94.30#53(e.gtld-servers.net) in 152 ms

;; connection timed out; no servers could be reached

The fact that no servers could be reached means something was wrong. Specifically, it meant that from our office network we were unable to reach any of Googles DNS servers.

I started to look at the network layer, see if that’s where the problems lay.

PING 216.239.32.10 (216.239.32.10): 56 data bytes
Request timeout for icmp_seq 0
92 bytes from 1-1-15.edge2-eqx-sin.moratelindo.co.id (202.43.176.217): Time to live exceeded

That was curious. Normally, we shouldn’t be seeing an Indonesian ISP (Moratel) in the path to Google. I jumped on one of CloudFlare’s routers to check what was going on. Meanwhile, others reports from around the globe on Twitter suggested we weren’t the only ones seeing the problem.

Internet Routing

To understand what went wrong you need to understand a bit about how networking on the Internet works. The Internet is a collection of networks, known as “Autonomous Systems” (AS). Each network has a unique number to identify it known as AS number. CloudFlare’s AS number is 13335, Google’s is 15169. The networks are connected together by what is known as Border Gateway Protocol (BGP). BGP is the glue of the Internet — announcing what IP addresses belong to each network and establishing the routes from one AS to another. An Internet “route” is exactly what it sounds like: a path from the IP address on one AS to an IP address on another AS.

BGP is largely a trust-based system. Networks trust each other to say which IP addresses and other networks are behind them. When you send a packet or make a request across the network, your ISP connects to its upstream providers or peers and finds the shortest path from your ISP to the destination network.

Unfortunately, if a network starts to send out an announcement of a particular IP address or network behind it, when in fact it is not, if that network is trusted by its upstreams and peers then packets can end up misrouted. That is what was happening here.

I looked at the BGP Routes for a Google IP Address. The route traversed Moratel (23947), an Indonesian ISP. Given that I’m looking at the routing from California and Google is operating Data Centre’s not far from our office, packets should never be routed via Indonesia. The most likely cause was that Moratel was announcing a network that wasn’t actually behind them.

The BGP Route I saw at the time was:

tom@edge01.sfo01> show route 216.239.34.10                          

inet.0: 422168 destinations, 422168 routes (422154 active, 0 holddown, 14 hidden)
+ = Active Route, - = Last Active, * = Both

216.239.34.0/24    *[BGP/170] 00:15:47, MED 18, localpref 100
                      AS path: 4436 3491 23947 15169 I
                    > to 69.22.153.1 via ge-1/0/9.0

Looking at other routes, for example to Google’s Public DNS, it was also stuck routing down the same (incorrect) path:

tom@edge01.sfo01> show route 8.8.8.8 

inet.0: 422196 destinations, 422196 routes (422182 active, 0 holddown, 14 hidden)
+ = Active Route, - = Last Active, * = Both

8.8.8.0/24         *[BGP/170] 00:27:02, MED 18, localpref 100
                      AS path: 4436 3491 23947 15169 I
                    > to 69.22.153.1 via ge-1/0/9.0

Route Leakage

(Image Credit: The Simpsons)

Situations like this are referred to in the industry as “route leakage”, as the route has “leaked” past normal paths. This isn’t an unprecedented event. Google previously suffered a similar outage when Pakistan was allegedly trying to censor a video on YouTube and the National ISP of Pakistan null routed the service’s IP addresses. Unfortunately, they leaked the null route externally. Pakistan Telecom’s upstream provider, PCCW, trusted what Pakistan Telecom’s was sending them and the routes spread across the Internet. The effect was YouTube was knocked offline for around 2 hours.

The case today was similar. Someone at Moratel likely “fat fingered” an Internet route. PCCW, who was Moratel’s upstream provider, trusted the routes Moratel was sending to them. And, quickly, the bad routes spread. It is unlikely this was malicious, but rather a misconfiguaration or an error evidencing some of the failings in the BGP Trust model.

The Fix

The solution was to get Moratel to stop announcing the routes they shouldn’t be. A large part of being a network engineer, especially working at a large network like CloudFlare’s, is having relationships with other network engineers around the world. When I figured out the problem, I contacted a colleague at Moratel to let him know what was going on. He was able to fix the problem at around 2:50 UTC / 6:50pm PST. Around 3 minutes later, routing returned to normal and Google’s services came back online.

Looking at peering maps, I’d estimate the outage impacted around 3–5% of the Internet’s population. The heaviest impact will have been felt in Hong Kong, where PCCW is the incumbent provider. If you were in the area and unable to reach Google’s services around that time, now you know why.

Building a Better Internet

This all is a reminder about how the Internet is a system built on trust. Today’s incident shows that, even if you’re as big as Google, factors outside of your direct control can impact the ability of your customers to get to your site so it’s important to have a network engineering team that is watching routes and managing your connectivity around the clock. CloudFlare works every day to ensure our customers get the optimal possible routes. We look out for all the websites on our network to ensure that their traffic is always delivered as fast as possible. Just another day in our ongoing efforts to #savetheweb.

  1. gie
    November 7, 2012 at 8:14 am

    pertanyaannya kenapa BGP nya google prefer ke moratel ? kalao di lihat as path nya kan diatas 4 hop, bukannya google pasti punya peering BGP di california sono, sehingga as path nya lebih pendek ?

  2. November 13, 2012 at 11:38 am

    Hmm mungkin si PCCW selaku upstream moratel, memiliki AS Hoop paling dekat dengan si Google.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: