ALL THE SERVERS (Sponsored by SUGAR)

So off to see the new Avengers tonight, super excited about it (probably due to a large amount of Pick & Mix and Energy drinks I have already consumed!)

Before I go, I thought I would rebuild the Burf.co server as the site has been running off my desktop computer for a week or so in preparation for the new server.  At the same point my MacBook Pro has also been filled up with the CommonCrawl!  The reason I took it offline in the first place was that it ran MongoDB like a dog!  Even running Raid 0, the full-text search was slower than my 69 year old mum! (She is really slow, bless her).

So the rebuild, I have scraped the raid 0 and put in an SSD. I am also running 2 instances of MongoDB on the same box.  The server has 128gb of ram now so should be fine however this time I want 2 distinct datasets without the power cost of running 2 servers (Yes I know I can run stuff in the cloud, but look up the costs of 4tb of space).

One data set will live on the 4TB drive and will be the raw data from CommonCrawl before I have processed it.  The other dataset, which will live on the SSD, will be the processed data for the search engine.   The aim is to have a much smaller refined set of keywords for each page that will live in memory, and in hard times be read off the SSD.  This approach also means I can reprocess the data as many times as I like, plus switch out the full-text engine (2nd instance of Mongo) for Postgres without losing the raw web pages held in the main MongoDB.

My original plan was to try and get between 1-5 million pages indexed which was more than the original Burf.com ever did.  The current solution is already at 7.7 million without breaking a sweat, and the new solution I hope to hit 50 million!

I did plan to crawl the web manually before I discovered the CommonCrawl (and I may still do parts), so I bought a second had HP c7000 blade server (Its a f@cking beast, and I can’t even lift it!).  However, I think it’s going to be repurposed for some machine learning stuff across the larger dataset.  I cant let 16 * 4 * 2 cores go to waste even though it keeps my house warm!

 

c7000 blade server

 

So next steps for Burf.co

  • Move all the data from the other machines on to new server and fire up the current Burf.co
  • Get 4TB of CommonCrawl web data and process it
  • Build a new search algorithm
  • Make the site sexy!

DeletedCity.net

My latest habit at the moment is to watch random documentaries about the history of the internet and the big players within that space.  So for example, the rise and fall of Yahoo, the deletion of Geocities and the history of Apple, Facebook, and Google.  I find the best place to find these rather geeky documentaries is on Youtube :).  I wouldn’t usually blog about this, however, while watching the deletion of Geocities (damn you Yahoo), a rather unique and cool project has come stem from it.  There been a lot of discussions lately on what should be done with old websites.  Sites like the Wayback Machine archive them for people to view, if you think about it,  Geocities is part of the history of the Internet, Geocities at its peak contained over 38 million user-created sites!  It appears before Yahoo could delete Geocities, the Archive Team backed up as many of the sites as it could into a massive  641 GB torrent file (which I plan to download somehow).

So where am I going with this?  Well, an artist called Richard Vijgen took the backup of Geocities and made an epic visualisation of it at Deletedcity.net,  I recommend everyone to go and have a look!  The Youtube video I was watching can be found

 

This weeks update : It’s a lie : Metabase, more VEX and some data science!

So, let’s get the lie out of the way, this week’s update could cover more or less than a week!  It is whatever I am thinking of at the time, that may or may not be happening.  So apologies for that bombshell.

Software

So, at work (O2’s Innovation Lab) I am currently learning data science stuff, for anyone who knows me, this is an extremely hard task as I have the focus of hamster on Redbull.  I am usually doing more than 1 thing (usually 5) and so it can be a struggle to learn a new skill, let alone one as difficult as data science.  This week, I would say I am starting to get somewhere.  I been using different classifiers across my data, checked its score and then looked at the confusion matrix.  What that told me was that my data sucked badly, however, the upside was I could prove that my data was terrible.

Another thing I am doing at work around data (oh look at my focus) needed me to take some data and put a GUI over the top for people to be able to easy “ask the data questions”, I found a really cool free tool called Metabase which worked really nicely.  All I needed to do was take an MS Access DB (oh boy who uses MS Access), convert it to CSV, and chuck it in a Postgres DB.  Would have taken 5 mins on a PC, a Mac took a little bit longer!

Robots

So what’s new on the robot front this week? well VEX Worlds is in less than 25 days and the software is erm…. still in development.  The EDR Tank should be on way to the US, so I made a mini version of it so that I can carry on with the development.  I have written some safety features into the software so that I don’t mow down innocent kids, mouthy kids, will, of course, run over!  The nex thing I need to do is finish the bridge between the VEX Cortex and the ROS software

ROS

I have a new friend on Facebook, (whoop whoop) who has been helping me with the ROS stuff, it’s useful to have a sounding board on learning new stuff, especially something as complex as ROS.  I have a fear that the VEX Tank may not work too well with all the people moving about.  Slam and autonomous driving works (very simple form) by identifying features in the environment to try and locate itself. when you have no real features (e.g a long corridor) or lots of things changing (e.g people moving about), it can get very confused.  I am sure robotics engineers have a good solution to this, but being a beginner and using Hector Slam for the first time,  I am not holding my breath.  My mini raspberry Pi / LEGO version got confused if I farted near it, let alone 10,000 kids running around!

Ending Notes

I started a statistics course as its the precursor to the Udacity Machine Learning Course.

I finished a Sentiment analysis course, pretty interesting, showed how to work out if a review to a film was positive or negative.

I watched Logan, was very good and rather violent and definitely not for the kids

I watched Kong, was pretty good but preferred the previous one, which to be fair is nothing like the new one.

I started printing the Inmoov project 🙂 THE BEST 3D Printed project in the world!