Populating a Database using a Large Dataset

tansey's Avatar

tansey

12 Feb, 2011 10:08 PM

I have a dictionary of about 50,000 words; each word has-many synonyms, antonyms, etc. I'm using Fluent NHibernate and I've created a 10gb MS SQL Server instance for the app and I am trying to populate it as a batch update:

List<Word> words;
...
using (IStatelessSession session = session.SessionFactory.OpenStatelessSession())
    using (var transaction = session.BeginTransaction())
        {
            foreach (var word in words)
                   session.Insert(word);
            transaction.Commit();
        }

I've set the batch size to 1000:

 private static ISessionFactory CreateSessionFactory()
 {
         return Fluently.Configure()
                //.Database(SQLiteConfiguration.Standard.InMemory())
                 .Database(MsSqlConfiguration.MsSql2008.ConnectionString(connStr).AdoNetBatchSize(1000))
                .Mappings(M => M.FluentMappings.AddFromAssemblyOf<WordMapping>())
                .ExposeConfiguration(Cfg => _configuration = Cfg)
                .BuildSessionFactory();
 }

It has been running for hours with no end in sight. Is this the best way to populate my database or is there a more expedient approach when dealing with large datasets?

Edit: Word model and mapping are below.

public class Word
{
     public virtual int Id { get; set; }
     public virtual string Text { get; set; }
     public virtual IList<Word> Synonyms { get; set; }
     public virtual IList<Word> Antonyms { get; set; }
}

public class WordMapping : ClassMap<Word>
{
    public WordMapping()
    {
        Id(x => x.Id).UnsavedValue(0);
        Map(x => x.Text);

        HasMany(x => x.Synonyms).Cascade.AllDeleteOrphan();
        HasMany(x => x.Antonyms).Cascade.AllDeleteOrphan();
    }
}

Edit 2: I can't seem to finish populating it at all. Everytime I run, I get a SQL Timeout Exception at some point.

  1. Support Staff 1 Posted by friism on 13 Feb, 2011 07:36 PM

    friism's Avatar

    I'm assuming you're running the above code from your local machine. You could probably speed up the process by running the data-population code from a machine with low latency to our database servers. You could, for example, boot up a an EC2 instance in the us-east region and run your code there.

    We'd love to learn how you would like to move data to your AppHarbor database. Right now we are looking at letting you upload SQL Server backup-files and having us restore them, or at providing a script that makes it easy to use to use SQL Server Bulk Copy (bcp).

  2. 2 Posted by tansey on 13 Feb, 2011 08:34 PM

    tansey's Avatar

    Thanks Michael. I looked into bcp a little, but I thought it can only populate one table? Since I have multiple tables, I didn't think I could use it.

    I've never used EC2 before. Any info on how to go about starting up an instance?

  3. Support Staff 3 Posted by friism on 13 Feb, 2011 08:49 PM

    friism's Avatar

    I'll try and do a PowerShell script that bcp's all tables in a database in the afternoon (although I have to think about referential constraints).

    For EC2, you have to register for Amazon Web Services. Once you're in the EC2 console, you can boot up Windows instance. It'll probably take you around an half-an-hour the first time you try it. For the data-population you need to do, a 'small' instance should do the job. Remember to shut down the instance once you're done (you pay by the hour).

  4. 4 Posted by tansey on 13 Feb, 2011 10:56 PM

    tansey's Avatar

    Great! For the shell script, would I just need to create a database locally and run the script on it?

  5. Support Staff 5 Posted by friism on 23 Feb, 2011 04:11 PM

    friism's Avatar

    Hi Wesley, really sorry about the wait.

    I have a console app for you, get it on Github and check out the explanatory blog post. Build the project and run the console app. You can see the required parameters in the source -- they should be self-explanatory.

    You need to make sure that source and destination database schemas match (you can generate database creation script by right-clicking source database -> tasks -> Generate scripts).

    Again, sorry about the wait and let me know if you hit any snags.

  6. friism closed this discussion on 23 Feb, 2011 04:11 PM.

Discussions are closed to public comments.
If you need help with AppHarbor please start a new discussion.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac