Some suggestion for storing a lot of simple data

append delete LeMoussel

I'm looking to crawl a lot of webpages (1,000,000) records and be able to store the linking structure for a later date. The way I planned on laying out the database was as follows:

`*Table: WebPages*`

ID                URL
----------    -------------------------------
1                  http://www.site1.com/page.php
2                  http://www.site2.com/page-abc.php
3                  http://www.site3.com/page-1.php
4                  http://www.site4.com/page-cd.php
5                  http://www.site5.com/page-nice.php
6                  http://www.site6.com/page-some.php
7                  http://www.site7.com/page-hrmm.php
8                  http://www.site8.com/page-stack.php
9                  http://www.site9.com/page-ex.php
10                http://www.site10.com/page-dba.php

`*Table: Links*`

SourceId  TargetId
----------    -----------
2                          1
3                          2
4                          8
5                          1
6                          3
7                          4
8                          5
8                          9
9                          7
10                         6

Basically I'll be able to see what webpages link to where recurrently/several levels deep per website. I want to map a large network of websites and their linking patterns.

So I need to know if there is a better way of doing with UnQlite and maybe some suggestions on how to design the database structure/system.
I was planning on SQLite to start with since I've used it some, but with this amount of data I'm open to anything.

append delete #1. flanhard

I recommended to start experimenting with UnQLite and SQLite for a couple of hundreds of links and the one that perform better is the right choice for this kind of stuff. Key/Value store always outperform relational db in the field.

append delete #2. roykfahey

To reduce space requirements you might consider a few things:
Not bothering to store http:// or the leading www. - this is just wasted space (though in a few cases the www. is required because people don't know how to configure their web sites properly).
Making sure you use data compression. Most systems are still I/O-bound, not CPU-bound.
Only storing any domain name once, and storing the page URLs separately. Both may actually repeat and storing them multiple times is wasteful.
