Scraping in Java

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • RollDamnTide
    SBR Rookie
    • 09-23-13
    • 3

    #1
    Scraping in Java
    I am by far way more proficient with java was wondering if someone could point me in the right direction in regards to this or is java just nearly impossible to handle these needs.
  • Blax0r
    SBR Wise Guy
    • 10-13-10
    • 688

    #2
    I recommend using the selenium library for scraping http://www.seleniumhq.org/.

    It's actually used for webpage testing, but can be re-purposed to scrape data off the static HTML as well as DOM stuff that's modified by javascript.
    Comment
    • Maverick22
      SBR Wise Guy
      • 04-10-10
      • 807

      #3
      If you want to start web scraping the first thing you need to do is make sure your database is exactly how you want it.

      You don't want to start coding the scraper and then still be in the process designing the database at the same time.

      After you build your database, you need to get a source code management solution going. Something like subversion or git...

      You might not understand the value of this... just trust me. DO IT. Integrate the subversion or git with your programming ide and always commit your code...

      You'd be really well served to have a server external to your home network do all the scraping, in case you get like IP banned or something. Look into a server from a website like digitalocean dot com. I run a server from there. just get a 5$ 64 bit linux server and keep it simple.

      This server can host your application, the source code control application, as well as your database. For 5 bucks a month, you are winning.

      As far as scraping the first thing you need to do is find out where you want your data to come from. Most of all the data you want will be inside html tables on some pages. So you will need to figure out how to convert that html table data to some data structure that you can manipulate and then store off to the database.

      Also... you will need to learn regular expressions. And by learn... i mean you should be able to be certified in regular expressions by the time you are done with your web scraper
      Comment
      • Maverick22
        SBR Wise Guy
        • 04-10-10
        • 807

        #4
        Hopefully this is good advice, i'm just communicating things I wish I knew back when I started
        Comment
        • Blax0r
          SBR Wise Guy
          • 10-13-10
          • 688

          #5
          I definitely agree with Maverick's point about SVN; honestly, I think everyone should use versioning software for everything (not just code).

          Although, I believe Selenium is a cleaner solution than regex's, but you'll have to re-code for every webpage change for either case, so it's really a matter of preference for static data.
          Comment
          • creditcardclown
            SBR High Roller
            • 11-28-10
            • 242

            #6
            maverick, have you been IP banned before for scraping? can you please explain the importance of source code management?

            regex sucks for HTML. i use python and LXML, and some tool for firefox "page inspector" i can get any info from a page within a few minutes.
            Comment
            • HUY
              SBR Sharp
              • 04-29-09
              • 253

              #7
              Originally posted by creditcardclown
              regex sucks for HTML. i use python and LXML, and some tool for firefox "page inspector" i can get any info from a page within a few minutes.
              This is what I'm doing as well. Whoever is parsing HTML with regex needs to get his head checked. Still, all programmers need to know regex anyway, just to handle the data once you get to it.
              Comment
              SBR Contests
              Collapse
              Top-Rated US Sportsbooks
              Collapse
              Working...