Tumbleweed and that damn CommonCrawl

I hate by starting a post apologising for not updating my blog so I won’t do that!

I have been a bit busy with the new job I started 3 weeks ago, so most of my side projects have been paused!  However, work on Burf.co has gone 2 steps forward, a couple to the left and then a couple steps backwards, this is largely due to the awesome site CommonCrawl.org having a huge part of the Internet crawled and open for anyone to use! They have petabytes of web data open for anyone to use and there are some really cool examples of how to use it, most involving a huge amount of cloud power!! I did ponder for quite a while how I would store so much data!!! I found an interesting Java project that scans the index of the CommonCrawl for interesting file types (https://github.com/centic9/CommonCrawlDocumentDownload).

I took this project, hacked it about a bit and changed it so that it would only return URLs that are mine type HTML and that had a response status of 200. This gave me around 50 million URLs to play with which all had file pointers to the actual web page data. Because this data is compressed, it’s far quicker to download them from the CommonCrawl than actually scrapping the website itself. CommonCrawl also follows the Robot.txt which is far more than I have ever done :). So far the end result is that I can get around 5 million pages of data a day (from my home internet) compared to around 500k on a good day!  That’s a pretty good increase!

DeletedCity.net

My latest habit at the moment is to watch random documentaries about the history of the internet and the big players within that space.  So for example, the rise and fall of Yahoo, the deletion of Geocities and the history of Apple, Facebook, and Google.  I find the best place to find these rather geeky documentaries is on Youtube :).  I wouldn’t usually blog about this, however, while watching the deletion of Geocities (damn you Yahoo), a rather unique and cool project has come stem from it.  There been a lot of discussions lately on what should be done with old websites.  Sites like the Wayback Machine archive them for people to view, if you think about it,  Geocities is part of the history of the Internet, Geocities at its peak contained over 38 million user-created sites!  It appears before Yahoo could delete Geocities, the Archive Team backed up as many of the sites as it could into a massive  641 GB torrent file (which I plan to download somehow).

So where am I going with this?  Well, an artist called Richard Vijgen took the backup of Geocities and made an epic visualisation of it at Deletedcity.net,  I recommend everyone to go and have a look!  The Youtube video I was watching can be found

 

Goodbye O2, Hello Reach

So 3 years ago, I joined O2’s innovation lab as their iOS developer. I didn’t actually do much hardcore iOS development per se, instead I researched new technology and rapidly created prototypes to demonstrate to the wider business. It was awesome fun and I learned so much on so many different fronts (how Corps works, AI/data science, VR/AR, the list goes on). As the lab developed, it’s future seemed to be focused on machine learning and big data, I decided to move in to Digital Products team to lead the Android development for O2Drive, O2’s telematics solution. Though I miss the crazy lab work, I missed doing mobile development more. My Android was a little out of date but I do enjoy a challenge and I learned a lot. I have to say that the O2Drive team was lovely and very welcoming! They really were driven to produce a great product!

So I start Reach on the 9th of April as their Application Development Lead which I am super excited about. I generally don’t change jobs very option (twice in 16 years) and don’t even have a cv! I believe they do a lot of Android and C# development but I am hoping to shake things up a bit with some Swift! It would be nice to do some iOS, Android and maybe try out cross platform solution like Xamarin or React. I believe they have used Xamarin in the past 🙂

To Lang=“en” or not to Lang?

So work has been progressing nicely on Burf.co and it’s up to 2mil + pages however, one of my aims was to only index English content. I thought by searching out for the HTML tag “lang”, it would be really easy to do! Even if there were a few different versions like en-gb or en-us, I thought it would still be an easy task. I also thought that all popular/mainstream/big sites would use this tag!

So 2 million English pages later and well…. I have over 300 different Lang=”*en*” variations! Plus major sites like Wikipedia don’t even use the tag!!

I guess it’s back to the drawing board!! I now need to use some sort of word matching algorithm to look at the page content and then work out if it’s English! A simplistic way of this could be to search for very common English words to see if they exists (the a at there etc) or not.

Update coming soon 🙂

Burf.co is up!!! But….

So the good news is that Burf.co Search Engine is up and running in a rather alpha stage, the bad news it’s full of adult material so please be warned!!! There is no safe search filter on it yet!!!

My plan is to improve the search algorithm and then put a adult content blocker in place to protect people! This of course can be turned off but trust me, there is some nasty stuff out there that no one wants to see!

Technology wise, the index is just over a million pages and will heavily increase when the new blade server arrives tomorrow. It’s using MongoDB full text search for the initial results. MongoDB did seem to sh@t the bed when indexing all the words on a page, so I look at only parts of each page.

More to come soon 🙂

This weeks update : Bye Bye MongoDB

So it is exciting times! I have made some progress with TRTLExchange, however, due to things outside of my control it been slower than expected.  So I have turned my spare time to Burf.co, my new search engine project and while there is no website for it yet (will be by the weekend), the actual search technology(code) has come along leaps and bounds.  Overnight it managed to index over 500,000 pages which for a single server, was pretty cool.  It did get up to 1.3 million pages but MongoDB has erm, shit the bed(many many times).  This could be a hardware limit (Harddrive speed) or some performance thing I need to do however it gets to the point where I can’t even insert more records without timeouts.  This concerns me quite a bit as I have a HP Blade Server on way to somewhat up the crawling rate by a factor of 8.  I am going to try and give it one last go today however its taken 12 hours to delete the data from the DB (I did remove instead of drop 🙁 ).  It has been a very interesting learning curve on learning MongoDB.  I think unless some magic happens I am going to try out Postgres next.

On the Swift front I did start building the frontend for Burf, first I was going to do this in VueJS, however, I have now found that Swift’s server-side framework Perfect supports templating via Mustache.  I think I will make faster progress writing it all in Swift than switching back and forth.   I still want to continue learning VueJS on the side (used for the TRTLExchange) as Javascript is such a good thing to know nowadays.

Writing this blog post has also just raised the point that I was trying to learn Kotlin about a month ago (facepalm).  Damn!

 

Experimenting with MongoKitten

As mentioned in my previous post, I have started looking at Server Side Swift with the aim to build a search engine (Burf.co).  To store my crawled data I decided to try and use MongoDB as it supports full-text search out of the box.  The original Burf.com used Equinox (made by Compsoft) and then later used Microsoft Indexing Service.  This time round I wanted to be a little more scalable.  Now there are probably better DB solutions for what I plan to do, but MongoDB seemed really simple to get up and running with.  Later on, I should be able to switch out the database layer if needed.

MongoKitten

Now that I had decided to use Swift, and MongoDB, I needed to find a framework that connects them, my friend (who knows his stuff) recommended MongoKitten!  I got up and running with it fairly quickly even though I don’t know MongoDB too well. Life was good, however, there were a few things I did struggle with:

Contains

So, search a field for a partial string requires you to use Regex (it seems).  

Mongo:

db.users.findOne({“username” : {$regex : “.*eBay.*”}});

MongoKitten:

let query: Query = [

           “url”: RegularExpression(pattern: “.\(eBay).”)

       ]

let matchingEntities: CollectionSlice<Document> = try pages.find(query)

Sorting results on $meta textScore

MongoDB allows you to setup full text searching across your data, it can be across an entire record, or just certain fields (name, address etc).  When you perform a full-text search, MongoDB returns the relevant records with an accuracy score ($meta.textScore).  MongoDB lets you change how it creates these scores by allowing you to adjust the weights each field receives e.g name is more important than address.

Mongo:

db.pages.find( {$text: {$search: “ebay”}},{score: {$meta: “textScore” }}).sort({score: {$meta:”textScore”}})

MongoKitten:

let query: Query = [

           “$text”: [“$search”: str ],

           “lang” : [“$eq”: “en”],

       ]        

let projection: Projection = [

            “_id”: .excluded,

           “url”: “url”,

           “title”: “title”,

           “score”: [“$meta” : “textScore”]

       ]    

let sort: Sort = [

           “score”: .custom([

               “$meta”: “textScore”

               ])

       ]      

let matchingEntities: CollectionSlice<Document> = try pages.find(query, sortedBy: sort, projecting: projection, readConcern:nil,  collation:nil, skipping: 0, limitedTo: Settings.searchResultLimit )

Getting Help

I found the best way to get help was to contact the creator(Joannis) of MongoKitten via Slack, he is pretty busy but super helpful!

This weeks update : Server side Swift and updates to the site!

So, in terms of coding, the last week has been rather busy, actually in terms of progress, it’s looking good on all front (except Hack24)!

First off, there have been some updates to the website to finally include a basic list of historic Robotics projects.  I had planned to do that last year!  I have also updated the design and content a little, which frankly no one will notice!

Next, I started yet another new project, Burf Search Engine (The Return).  It had been on the cards for a while, however, I have now started coding it.  First in NodeJS, then in Server Side Swift using the Perfect framework.  It went swimmingly well (thanks to Ad) until I had to deploy it to Ubuntu.  There, you start to learn the differences between Apple frameworks and the open source versions of the same framework. Randomly I am expecting the Apple framework to have less //todo and //to implement comments 🙂

The Crypto project didn’t process as far as I hoped this week purely due to my friend (and other coder) being ill, however, momentium has now picked back up.  As your also probably aware Cryptocurrencies took a massive dive this week!

No update on Hack24, it will resume after phase 1 of the Crypto project has been completed and the search engine crawler for Burf.co is live 🙂

On the side note, really enjoying Altered Carbon on Netflix’s and Thanks to Ford for sending me this epic LEGO set 75885 (Ford Fiesta M-Sport WRC) to review at the weekend.

 

 

Humble Bundle: Mobile Development Bundle

So a new Humble Bundle has appeared that’s aimed at mobile devs which is cool!  It contains a large range of books by Packt which will help any Mobile Dev improve their skills, details below:

https://www.humblebundle.com/books/mobile-app-development-books

I am sure I have done it before, but I would also like to point out that Packt also gives away a free book every day!!!  Claiming it is one of my daily tasks!

https://www.packtpub.com/packt/offers/free-learning

 

Where is Hack24?

So Hack24 is becoming a bit more delayed than hoped, however, this is for good reasons (Well I think so).

I have been introduced to cryptocurrencies, via my friend sending me a link to https://turtlecoin.lol/ and telling me to mine the sh*t out of it!

Knowing a little (and I mean a little) about BitCoin (and its huge growth in value), I thought it was a great opportunity to find out more about this technology without investing any money.

Since I started mining it about 2 weeks ago, I have watched several documentaries with my wife about digital currencies plus the actual TRTL coin has gone up 100x in value!

If nothing else it’s been a fun learning experience and I feel a bit more comfortable around the subject.  I am not working on a very short-term project in this space 🙂

Hack24 will resume very shortly 🙂