The U.S. Government website data.gov current lists 281,445 publicly available datasets (including those of healthdata.gov). The EU Open Data Portal lists 12,346. Ancestry LLC (ancestry.com) claims about 16 billion historical records. Global academic and research institutions maintain a vast network of databases and search engines for every academic specialty imaginable. Data is stored as spreadsheets, text files, scans, and in every imaginable format for advised AI and machine learning tool. Some of these datasets originate in the 1800s. They are in all languages and measure the social, environmental, physical and economic systems of almost every country on the globe. And beyond those publicly available, millions more are held by institutions and corporations around the world.
Databases are everywhere. There are easily millions of them. There might be billions.
The data they contain spans everything from the human genome to trade manifests and the weather, and the sheer volume of information is growing rapidly. They certainly contain specific information that others know about you, your interactions with the government, and historical information about your family, but also anonymized and summarized information about your economic and social situation as compared to others in the population.
What makes databases unique as data sources is that they bring with them a much longer timeline and deeper history, but with far less detail and granularity if we go back even a decade or two. In addition, a database describes only what “was” at the point in time it was collected: the information they contain may or may not be relevant at the point of a later query.