I will no longer use Go on this project. It turns out that I am not comfortable with the way Go libraries are created and used and all of the path setting wahala.
I am not saying Go is a horrible language – I am not in any position to say such a thing. I feel like the language has little support from the developer community.
It also needs better IDE support as my favorite editor (Jetbrains’s Intellij) doesn’t come with support for it, except a plugin is installed. IMO, the Go plugin for Intellij needs more work on GOPATH handling and code refactoring.
Or maybe I am not a good programmer. Maybe I still need more time to understand and be comfortable with Go. Well, I just have to move on with what I know for now.
Aside everything else, I love the simplicity of Go and the fact that concurrency is a first class feature of the language.
I will now fallback to my beautifully Scala. All existing Go codes will be rewritten in Scala.
I was looking for a robots.txt parser written in Go and found some on github. I didn’t like or understand them, so I wrote myself one. You can find Robotty on Github.
Now I can continue with the rest of the Crawler 🙂
The URL Manager is the component that is responsible for serving sources to the crawlers. The manager serves sources from the seed store and from a custom source collection or database.
A store is any website that acts as an online store and sells products of different kind. For example, Konga or OLX are examples of stores.
A seed is a specially selected store required to start up the system. I am presently working with Konga, Jumia and OLX.
A seed store is a directory that contains map files of seeds. A map file is a JSON file that contains accurate description on how a source can be efficiently crawled, parsed and indexed.
Map files are named in this format – [store_name].json where store_name is the name of the store e.g konga.json. In coming posts, I will describe the properties of the map file.
The manager distributes seeds by pushing them into a redis list using the lpush command while the crawlers use the rpop command to get jobs.
The manager works in circles and would try to re-enque the seeds to the queue. This circle starts every 6 hours by default – So sources are checked for updates every 6 hours. Only sources that have not been visited in the last 6 hours will be requeued.
The sources are first converted from objects to JSON string and then to Base64 string. The receiving crawler can create a source object from this string. If encryption is required, it can be used quite easily.
Next up is the crawler – I hope to have a working version by tomorrow and would make a post describing it.
The system is composed of the following components:
1. Seed Store
2. URL Manager
3. Spiders / Crawlers
4. HTML Store
6. Search Server
The seed store holds the initial websites to index. All information required to efficiently transverse the website and extract relevant information is provided in a map.json file for each sources.
This component is responsible for the distribution of sources/work to crawlers. A source already assigned to a crawler will not be reassigned to another crawler for politeness reasons.
What I am trying to do is to ensure this part of the system is independent from everything else so that it can function on its own.
I am going to use a queuing system like redis or rabbitmq to distribute work among other components that are hosted on different machines.
Crawler / Spider
The crawlers receives a source from the queue and uses the map file to appropriately crawl the source. Only one connection is maintained per source or domain. Although, the ability to increase the number of connections per source would be supported. Robots.txt rules are also obeyed.
Holds HTML pages downloaded by the crawlers. The plan is to use a redis server to store data. Parsers will query this database to process the pages.
These are a bunch of workers querying the HTML Store for pages to process. They are responsible for making sense of the pages. Extracting information with the help of a map file. Extracted information is indexed immediately.
This is an ElasticSearch stack responsible for analyzing, indexing and searching the crawled data. This is the component that clients interact with.
Today, I made a decision to build a better and improved version of the app that helped me win an aptech programming competition.
Shoplog is a search engine that indexes products that have been put up for sale on shopping or classified ads websites.
The utility will make it easy for shoppers to search and compare products from multiple online stores.
The system should be able to
1. Automatically discover products
2. Efficiently parse and make sense of the product information
3. Fast indexing and search speed
4. Support third party integration
The simple goal is to be able to enter a product name or/and features and be able to get a bunch of result from different stores.
I plan to develop this in just a month and would be making use of Google cloud compute infrastructure. The system will be designed mostly with Go and Scala.
The plan is to talk about everything upto completion. In the next post, I will talk about the system design.
It’s been a while since my last post. A lot has changed since then – I now work for Save & Buy @ the CCHub. It’s been fun. I work with cool and smart people. I should also add that I am the lead developer there 😂 😂
I am almost done with with Aptech too – I have a couple of days left. I also have a babe now 😀 (she is not my first o. I’m not a nerd like that). At this rate, I may just get married in two years (I rebuke it :D) . Why is everything moving too fast :(:(
Anyways, I hope to blog more about my
state of mind and activities. This is probably the only way I can remain sane as not many people around me can understand or have the patience to want to understand an ambitious programmer’s hustle.
Before I end this post, let me share a totally useless quote:
The world is round, gravity is invisible
The year seems to be looking good for me. I feel like I’m growing — finally. Although, I’m not too sure where I’m headed. I’m not sure if I’m going to own a startup or work for one. But I see changes.
My last post about a payment gateway powered by recharge cards was featured on TechCabal which is one of the top tech blogs in the country. As a result of that, people mailed me. I got some attention (even though I wasn’t really looking for any–who am I kidding I loved it). I got job offers and collaboration request, some of which I’m still considering.
I also completed my first client job. A perfume shopping website. I never wanted to build websites/app for anyone. I am not a fan of client -developer relationship. But I was loosing my head. I needed to do something to prove that I can code. It sucks when people hail you and you’ve become some sort of local celebrity in your street/school/circle but nobody has really seen what you’ve done.
To further make it clear that I can really code, I took part in my “school’s” anniversary coding competition where I won the first prize (well, I still haven’t received the first prize. Damn! Indians 😦 ). I developed a search engine that does something (I can’t say yet. I might further develop it into a startup 😀 ) in 2 day. I coded like hell. I did not go to school for those days. I did not do my usual fancy system design sketches on paper. I just knew what was needed (crawler, parser, indexer, search server, client server) and built it. Although, It wasn’t the first time I was doing something like this. Of course there were bugs…I’m not a coding god just yet.
So now that I’ve proven to myself and every other person that I can code. I want to revisit this project. I want to see if I’m motivated enough to get a beta version out. I don’t know if I can but it would go a long way in building my self confidence if I could because if you ask me, I cannot categorically tell you I can. Not because I cannot code, but I just can’t get things done if I’m not under some sought of pressure.
In my next post, I’m going to talk about my next experiment/prototype — The social music player for the internet.