So one of the only plus points of not getting any sleep due to the effects of the extreme diet for this bodybuilding show is insomnia, my mind gets very creative and forces me to start kicking off new ideas, projects, missions et!
So, if you saw my last post, I said I was gonna finish Hack24, fix Burf.co and sort the garage! So far, Burf.co is back up, but about to completely change, the garage is nearly finished being geared up as a robotics lab and Hack24 has not moved. I do want to finish hack24 but I don’t want to rush it and I want to harness my energy on some crazy robotics ideas while my brain still works 🙂
So the plan v2! Warning it’s a little bonkers, even for me!
Build a backend set of machine learning API’s that Burf.co, mobile devices, and my robots use to send and retrieve data. The idea is I could send it a question, a command or an image and it does some magic and responds.
- So for mobile devices, they would send images and text to speech, it would return ImageNet classification or answers to questions.
- Burf.co would become more of a knowledge base system using NLP to feed into other systems.
- There would also be a public facing chatbot which would hopefully learn off of all of this. Planning a system POC using AIML to test the waters
- This would all somehow be also brought together to add some usefulness to my future robotics projects (image classification, knowledge base, etc)
I brought some odd bits of hardware, upgraded the Burf.co server, brought some domains, and started rewriting Burf.co in Java. I decided I want to try and use a common language and randomly Java seemed the best fit (client, server, mobile etc)
It’s gonna be a slow progress but I think its gonna be exciting.
Sorry for the last few months there has been little update. I decided that before I got too old I should enter a bodybuilding contest, honestly seemed like a good idea at the time. Well it’s been the hardest 3 months of my life to be honest. First off it isn’t the cheapest thing to do (I have spent £500 on chicken alone), you have to be super disciplined (up at 5, 1 hour cardio, training even if ill), you become very moody, tired and even get insomnia (writing this at 3am) because your constantly hungry! However with less than 2 weeks to go, I am still chuffed I decided to do it. On a positive side, because you don’t sleep as much, your brain seems to be come very creative!!!
So what’s the plan Batman?
So, I have many plans, lots of ideas and have been researching lots of stuff but I need to do some house keeping first!
1) Release a v1 of Hack24 cross platform to prove the framework works.
2) Fix Burf.co Search Engine because I turned it off.
3) Finish the garage so that large projects are possible 🙂
All of the above are in progress and I hope to have them done very soon. Then it is full speed ahead for some cool robotics / machine learning project that I will discuss in my next post 🙂
So off to see the new Avengers tonight, super excited about it (probably due to a large amount of Pick & Mix and Energy drinks I have already consumed!)
Before I go, I thought I would rebuild the Burf.co server as the site has been running off my desktop computer for a week or so in preparation for the new server. At the same point my MacBook Pro has also been filled up with the CommonCrawl! The reason I took it offline in the first place was that it ran MongoDB like a dog! Even running Raid 0, the full-text search was slower than my 69 year old mum! (She is really slow, bless her).
So the rebuild, I have scraped the raid 0 and put in an SSD. I am also running 2 instances of MongoDB on the same box. The server has 128gb of ram now so should be fine however this time I want 2 distinct datasets without the power cost of running 2 servers (Yes I know I can run stuff in the cloud, but look up the costs of 4tb of space).
One data set will live on the 4TB drive and will be the raw data from CommonCrawl before I have processed it. The other dataset, which will live on the SSD, will be the processed data for the search engine. The aim is to have a much smaller refined set of keywords for each page that will live in memory, and in hard times be read off the SSD. This approach also means I can reprocess the data as many times as I like, plus switch out the full-text engine (2nd instance of Mongo) for Postgres without losing the raw web pages held in the main MongoDB.
My original plan was to try and get between 1-5 million pages indexed which was more than the original Burf.com ever did. The current solution is already at 7.7 million without breaking a sweat, and the new solution I hope to hit 50 million!
I did plan to crawl the web manually before I discovered the CommonCrawl (and I may still do parts), so I bought a second had HP c7000 blade server (Its a f@cking beast, and I can’t even lift it!). However, I think it’s going to be repurposed for some machine learning stuff across the larger dataset. I cant let 16 * 4 * 2 cores go to waste even though it keeps my house warm!
So next steps for Burf.co
- Move all the data from the other machines on to new server and fire up the current Burf.co
- Get 4TB of CommonCrawl web data and process it
- Build a new search algorithm
- Make the site sexy!