So one of the only plus points of not getting any sleep due to the effects of the extreme diet for this bodybuilding show is insomnia, my mind gets very creative and forces me to start kicking off new ideas, projects, missions et!
So, if you saw my last post, I said I was gonna finish Hack24, fix Burf.co and sort the garage! So far, Burf.co is back up, but about to completely change, the garage is nearly finished being geared up as a robotics lab and Hack24 has not moved. I do want to finish hack24 but I don’t want to rush it and I want to harness my energy on some crazy robotics ideas while my brain still works 🙂
So the plan v2! Warning it’s a little bonkers, even for me!
Build a backend set of machine learning API’s that Burf.co, mobile devices, and my robots use to send and retrieve data. The idea is I could send it a question, a command or an image and it does some magic and responds.
- So for mobile devices, they would send images and text to speech, it would return ImageNet classification or answers to questions.
- Burf.co would become more of a knowledge base system using NLP to feed into other systems.
- There would also be a public facing chatbot which would hopefully learn off of all of this. Planning a system POC using AIML to test the waters
- This would all somehow be also brought together to add some usefulness to my future robotics projects (image classification, knowledge base, etc)
I brought some odd bits of hardware, upgraded the Burf.co server, brought some domains, and started rewriting Burf.co in Java. I decided I want to try and use a common language and randomly Java seemed the best fit (client, server, mobile etc)
It’s gonna be a slow progress but I think its gonna be exciting.
Sorry for the last few months there has been little update. I decided that before I got too old I should enter a bodybuilding contest, honestly seemed like a good idea at the time. Well it’s been the hardest 3 months of my life to be honest. First off it isn’t the cheapest thing to do (I have spent £500 on chicken alone), you have to be super disciplined (up at 5, 1 hour cardio, training even if ill), you become very moody, tired and even get insomnia (writing this at 3am) because your constantly hungry! However with less than 2 weeks to go, I am still chuffed I decided to do it. On a positive side, because you don’t sleep as much, your brain seems to be come very creative!!!
So what’s the plan Batman?
So, I have many plans, lots of ideas and have been researching lots of stuff but I need to do some house keeping first!
1) Release a v1 of Hack24 cross platform to prove the framework works.
2) Fix Burf.co Search Engine because I turned it off.
3) Finish the garage so that large projects are possible 🙂
All of the above are in progress and I hope to have them done very soon. Then it is full speed ahead for some cool robotics / machine learning project that I will discuss in my next post 🙂
So I been on holiday driving across America, it was great fun! I was hoping to be inspired on what to do with Burf.co, what’s the business plan, the focus etc. Sadly absolutely nothing came to mind! Not a dime! Technically I can now index a lot of records pretty fast! I have enough hardware to heat up a street of houses (more than my ring-main could handle) however I still don’t know what the point is. So for the moment, it’s being paused!
With that in mind, I thought I would carry on with Hack24 and see if I can get it out the door. I am still using LibGDX however, I have to moved to Multi-OS Engine for iOS and I am going to see if I can use FireBase for the backend.
Hopefully, by the time I have an MVP built, I should then know what I am going to do with Burf.co
So off to see the new Avengers tonight, super excited about it (probably due to a large amount of Pick & Mix and Energy drinks I have already consumed!)
Before I go, I thought I would rebuild the Burf.co server as the site has been running off my desktop computer for a week or so in preparation for the new server. At the same point my MacBook Pro has also been filled up with the CommonCrawl! The reason I took it offline in the first place was that it ran MongoDB like a dog! Even running Raid 0, the full-text search was slower than my 69 year old mum! (She is really slow, bless her).
So the rebuild, I have scraped the raid 0 and put in an SSD. I am also running 2 instances of MongoDB on the same box. The server has 128gb of ram now so should be fine however this time I want 2 distinct datasets without the power cost of running 2 servers (Yes I know I can run stuff in the cloud, but look up the costs of 4tb of space).
One data set will live on the 4TB drive and will be the raw data from CommonCrawl before I have processed it. The other dataset, which will live on the SSD, will be the processed data for the search engine. The aim is to have a much smaller refined set of keywords for each page that will live in memory, and in hard times be read off the SSD. This approach also means I can reprocess the data as many times as I like, plus switch out the full-text engine (2nd instance of Mongo) for Postgres without losing the raw web pages held in the main MongoDB.
My original plan was to try and get between 1-5 million pages indexed which was more than the original Burf.com ever did. The current solution is already at 7.7 million without breaking a sweat, and the new solution I hope to hit 50 million!
I did plan to crawl the web manually before I discovered the CommonCrawl (and I may still do parts), so I bought a second had HP c7000 blade server (Its a f@cking beast, and I can’t even lift it!). However, I think it’s going to be repurposed for some machine learning stuff across the larger dataset. I cant let 16 * 4 * 2 cores go to waste even though it keeps my house warm!
So next steps for Burf.co
- Move all the data from the other machines on to new server and fire up the current Burf.co
- Get 4TB of CommonCrawl web data and process it
- Build a new search algorithm
- Make the site sexy!
I hate by starting a post apologising for not updating my blog so I won’t do that!
I have been a bit busy with the new job I started 3 weeks ago, so most of my side projects have been paused! However, work on Burf.co has gone 2 steps forward, a couple to the left and then a couple steps backwards, this is largely due to the awesome site CommonCrawl.org having a huge part of the Internet crawled and open for anyone to use! They have petabytes of web data open for anyone to use and there are some really cool examples of how to use it, most involving a huge amount of cloud power!! I did ponder for quite a while how I would store so much data!!! I found an interesting Java project that scans the index of the CommonCrawl for interesting file types (https://github.com/centic9/CommonCrawlDocumentDownload).
I took this project, hacked it about a bit and changed it so that it would only return URLs that are mine type HTML and that had a response status of 200. This gave me around 50 million URLs to play with which all had file pointers to the actual web page data. Because this data is compressed, it’s far quicker to download them from the CommonCrawl than actually scrapping the website itself. CommonCrawl also follows the Robot.txt which is far more than I have ever done :). So far the end result is that I can get around 5 million pages of data a day (from my home internet) compared to around 500k on a good day! That’s a pretty good increase!
My latest habit at the moment is to watch random documentaries about the history of the internet and the big players within that space. So for example, the rise and fall of Yahoo, the deletion of Geocities and the history of Apple, Facebook, and Google. I find the best place to find these rather geeky documentaries is on Youtube :). I wouldn’t usually blog about this, however, while watching the deletion of Geocities (damn you Yahoo), a rather unique and cool project has come stem from it. There been a lot of discussions lately on what should be done with old websites. Sites like the Wayback Machine archive them for people to view, if you think about it, Geocities is part of the history of the Internet, Geocities at its peak contained over 38 million user-created sites! It appears before Yahoo could delete Geocities, the Archive Team backed up as many of the sites as it could into a massive 641 GB torrent file (which I plan to download somehow).
So where am I going with this? Well, an artist called Richard Vijgen took the backup of Geocities and made an epic visualisation of it at Deletedcity.net, I recommend everyone to go and have a look! The Youtube video I was watching can be found
So 3 years ago, I joined O2’s innovation lab as their iOS developer. I didn’t actually do much hardcore iOS development per se, instead I researched new technology and rapidly created prototypes to demonstrate to the wider business. It was awesome fun and I learned so much on so many different fronts (how Corps works, AI/data science, VR/AR, the list goes on). As the lab developed, it’s future seemed to be focused on machine learning and big data, I decided to move in to Digital Products team to lead the Android development for O2Drive, O2’s telematics solution. Though I miss the crazy lab work, I missed doing mobile development more. My Android was a little out of date but I do enjoy a challenge and I learned a lot. I have to say that the O2Drive team was lovely and very welcoming! They really were driven to produce a great product!
So I start Reach on the 9th of April as their Application Development Lead which I am super excited about. I generally don’t change jobs very option (twice in 16 years) and don’t even have a cv! I believe they do a lot of Android and C# development but I am hoping to shake things up a bit with some Swift! It would be nice to do some iOS, Android and maybe try out cross platform solution like Xamarin or React. I believe they have used Xamarin in the past 🙂
So work has been progressing nicely on Burf.co and it’s up to 2mil + pages however, one of my aims was to only index English content. I thought by searching out for the HTML tag “lang”, it would be really easy to do! Even if there were a few different versions like en-gb or en-us, I thought it would still be an easy task. I also thought that all popular/mainstream/big sites would use this tag!
So 2 million English pages later and well…. I have over 300 different Lang=”*en*” variations! Plus major sites like Wikipedia don’t even use the tag!!
I guess it’s back to the drawing board!! I now need to use some sort of word matching algorithm to look at the page content and then work out if it’s English! A simplistic way of this could be to search for very common English words to see if they exists (the a at there etc) or not.
Update coming soon 🙂
So the good news is that Burf.co Search Engine is up and running in a rather alpha stage, the bad news it’s full of adult material so please be warned!!! There is no safe search filter on it yet!!!
My plan is to improve the search algorithm and then put a adult content blocker in place to protect people! This of course can be turned off but trust me, there is some nasty stuff out there that no one wants to see!
Technology wise, the index is just over a million pages and will heavily increase when the new blade server arrives tomorrow. It’s using MongoDB full text search for the initial results. MongoDB did seem to sh@t the bed when indexing all the words on a page, so I look at only parts of each page.
More to come soon 🙂
So it is exciting times! I have made some progress with TRTLExchange, however, due to things outside of my control it been slower than expected. So I have turned my spare time to Burf.co, my new search engine project and while there is no website for it yet (will be by the weekend), the actual search technology(code) has come along leaps and bounds. Overnight it managed to index over 500,000 pages which for a single server, was pretty cool. It did get up to 1.3 million pages but MongoDB has erm, shit the bed(many many times). This could be a hardware limit (Harddrive speed) or some performance thing I need to do however it gets to the point where I can’t even insert more records without timeouts. This concerns me quite a bit as I have a HP Blade Server on way to somewhat up the crawling rate by a factor of 8. I am going to try and give it one last go today however its taken 12 hours to delete the data from the DB (I did remove instead of drop 🙁 ). It has been a very interesting learning curve on learning MongoDB. I think unless some magic happens I am going to try out Postgres next.
Writing this blog post has also just raised the point that I was trying to learn Kotlin about a month ago (facepalm). Damn!