Data sites and service

Kang Wang
4 min readApr 14, 2022

Streamlining Your Search

While they may not always be easy to find, many databases on the web are indexed by search engines, whether the publisher intended this or not. Here are a few tips:

  • When searching for data, make sure that you include both search terms relating to the content of the data you’re trying to find, as well as some information on the format or source that you would expect it to be in. Google and other search engines allow you to search by file type. For example, you can look only for spreadsheets (by appending your search with “filetype:XLS filetype:CSV”), geodata (“filetype:shp”), or database extracts (“filetype:MDB, filetype:SQL, filetype:DB”). If you’re so inclined, you can even look for PDFs (“filetype:pdf”).
  • You can also search by part of a URL. Googling for “inurl:downloads filetype:xls” will try to find all Excel files that have “downloads” in their web address (if you find a single download, it’s often worth just checking what other results exist for the same folder on the web server). You can also limit your search to only those results on a single domain name, by searching for “site:agency.gov”, for example.
  • Another popular trick is not to search for content directly, but for places where bulk data may be available. For example, “site:agency.gov Directory Listing” may give you some listings generated by the web server with easy access to raw files, while “site:agency.gov Database Download” will look for intentionally created listings.

Official data portals

global index can be find at data catalogs . Another is the Guardian World government data, a meta engine that includes many international government catalogues

The data Hub

A community-driven resource run by the Open Knowledge Foundation that makes it easy to find, share, and reuse openly available sources of data, especially in ways that are machine-automated.

QuickCode

An online tool to make the process of extracting “useful bits of data easier so they can be reused in other apps, or rummaged through by journalists and researchers.” Most of the scrapers and their databases are public and can be reused.

World Bank and United Nations data portals

These services provide high-level indicators for all countries, often for many years in the past.

Research data

There are numerous national and disciplinary aggregators of research data, such as the UK Data Archive. While there will be lots of data that is free at the point of access, there will also be much data that requires a subscription, or which cannot be reused or redistributed without asking permission first

Ask for forum

Search for existing answers or ask a question at Get The Data or Quora.

Ask a Mailing List

Mailing lists combine the wisdom of a whole community on a particular topic. For data journalists, the Data-Driven Journalism List and the NICAR-L lists are excellent starting points. Both of these lists are filled with data journalists and Computer-Assisted Reporting (CAR) geeks, who work on all kinds of projects. Chances are that someone may have done a story like yours, and may have an idea of where to start, if not a link to the data itself. You could also try Project Wombat (http://project-wombat.org/; “a discussion list for difficult reference questions”), the Open Knowledge Foundation’s many mailing lists, mailing lists at theInfo, or searching for mailing lists on the topic or in the region that you are interested in.

Join Hacks/Hackers

Hacks/Hackers is a rapidly expanding international grassroots journalism organization with dozens of chapters and thousands of members across four continents. Its mission is to create a network of journalists (“hacks”) and technologists (“hackers”) who rethink the future of news and information. With such a broad network, you stand a strong chance of someone knowing where to look for the thing you seek.

You’ve tried everything else, and you haven’t managed to get your hands on the data you want. You’ve found the data on the Web, but, alas — no download options are available and copy-paste has failed you. Fear not, there may still be a way to get the data out. For example you can:

  • Get data from web-based APIs, such as interfaces provided by online databases and many modern web applications (including Twitter, Facebook, and many others). This is a fantastic way to access government or commercial data, as well as data from social media sites.
  • Extract data from PDFs. This is very difficult, as PDF is a language for printers and does not retain much information on the structure of the data that is displayed within a document. Extracting information from PDFs is beyond the scope of this book, but there are some tools and tutorials that may help you do it.
  • Screen scrape websites. During screen scraping, you’re extracting structured content from a normal web page with the help of a scraping utility or by writing a small piece of code. While this method is very powerful and can be used in many places, it requires a bit of understanding about how the web works.

--

--

Kang Wang

Process Mining, Data Scientist, Research Software Engineer, Climbing, ERP.