UnQLite Users Forum

Some suggestion for storing a lot of simple data

append delete LeMoussel

I'm looking to crawl a lot of webpages (1,000,000) records and be able to store the linking structure for a later date. The way I planned on laying out the database was as follows:

`*Table: WebPages*`

ID                URL
----------    -------------------------------
1                  http://www.site1.com/page.php
2                  http://www.site2.com/page-abc.php
3                  http://www.site3.com/page-1.php
4                  http://www.site4.com/page-cd.php
5                  http://www.site5.com/page-nice.php
6                  http://www.site6.com/page-some.php
7                  http://www.site7.com/page-hrmm.php
8                  http://www.site8.com/page-stack.php
9                  http://www.site9.com/page-ex.php
10                http://www.site10.com/page-dba.php

`*Table: Links*`

SourceId  TargetId
----------    -----------
2                          1
3                          2
4                          8
5                          1
6                          3
7                          4
8                          5
8                          9
9                          7
10                         6

Basically I'll be able to see what webpages link to where recurrently/several levels deep per website. I want to map a large network of websites and their linking patterns.

So I need to know if there is a better way of doing with UnQlite and maybe some suggestions on how to design the database structure/system.
I was planning on SQLite to start with since I've used it some, but with this amount of data I'm open to anything.

Reply RSS


append delete #1. flanhard

I recommended to start experimenting with UnQLite and SQLite for a couple of hundreds of links and the one that perform better is the right choice for this kind of stuff. Key/Value store always outperform relational db in the field.

append delete #2. roykfahey

To reduce space requirements you might consider a few things:
Not bothering to store http:// or the leading www. - this is just wasted space (though in a few cases the www. is required because people don't know how to configure their web sites properly).
Making sure you use data compression. Most systems are still I/O-bound, not CPU-bound.
Only storing any domain name once, and storing the page URLs separately. Both may actually repeat and storing them multiple times is wasteful.
Check more info from the experts at http://cheapessaywritingservice.us/


(Leave this as-is, it’s a trap!)

There is no need to “register”, just enter the same name + password of your choice every time.

Pro tip: Use markup to add links, quotes and more.

Your friendly neighbourhood moderators: chm_at_symisc, devel_at_symisc