So far, you could only create a dataset by uploading a CSV file. But let's face it, every business has a huge amount of data stored in various databases, so why bother with CSV. Depending on the employer's request, various systems sort data and extract the most important from them using SQL. Lucky for you, with Graphite, you can connect to your database and write your own SQL. Let's figure out how to do it.
As soon as you log in to Graphite, go to Datasets and click on Create New. You can choose a connection to MySQL/MariaDB or PostgreSQL database. While other connections are being developed, such as MS SQL, Amazon RedShift, etc., there is a little hack: in case your only data source is RedShift, just create a PG connection with Redshift parameters and the connection should work.
After selecting a connection, define the name for your dataset. Additionally, you can write a description, or select/create a tag.
Now we come to the most important part, establishing a connection. You have to enter your server hostname or IP address, database port, database user, database password, and database name. After that, click the Check Connection button. To enable the connection to your database, please ensure that your firewall accepts incoming requests from the following two IP addresses: 188.8.131.52 and 184.108.40.206.
After your connection is established, it's time to show us your SQL knowledge - write the desired SQL and click the Run SQL button to get your data.
By scrolling down, all the columns from the selected dataset will appear. If necessary, you can change column names, data type, or data format; click on the Create button to create your dataset. It's much easier to get data from databases using SQL - you adjust the dataset to your needs! By repeating the above steps, you can easily get your data and start running various models without writing down any line of code. 🙂
Free and quality datasets
For each dataset, it is necessary to determine its quality. Several characteristics describe high-quality data, but it is especially important to point out accuracy, reliability, and completeness. Every high-quality data should be precise and error-free, otherwise, your data is misleading and inefficient. If your data is not complete, it is harder to put the data in use because of the lack of information. What if your data is ambiguous or vague? Basically, you cannot trust your data, it's unreliable.
Data quality is the answer to the question “How is my data?” If your data helps you with business operations and decisions, you can say that your data is of good quality.
By googling stuff like free datasets for machine learning, time-series dataset, classification dataset, etc., you see a bunch of links to different sources. But which of them include high-quality data? We will list a few sources, but it is essential to know that among them there are also data that have their drawbacks. Therefore, you have to be familiar with the characteristics of a good dataset.
Kaggle is a big data-science competition platform for predictive modeling and analytics. There are plenty of datasets you can use to learn artificial intelligence and machine learning. Most of the data is real and referenced, so you can test or improve your skills or even work on projects that could help people.
Each dataset has its usability score and description. Within the dataset, there are various tabs such as Tasks, Code, Discussions, etc. Most datasets are related to different projects, so you can find different models which are trained and tested on the same datasets. On Kaggle, you can find a big community of data analysts, data scientists, and machine learning engineers who can evaluate your work and give you useful tips for further development.
UCI Machine Learning Repository
The UCI Machine Learning Repository is a database of high-quality and real-world datasets for machine learning algorithms. Datasets are well known in terms of interesting properties and expected good results; they can be an example of useful baselines for comparisons. On the other hand, the datasets are small and already pre-processed.
GitHub is one of the world’s largest communities of developers. The main purpose of GitHub is to be a code repository service. In most cases within a project, we can find its application on some datasets; you will need to spend a little more time to find the wanted dataset, but it will be worth it.
data.world is a large data community where people discover data and share analysis. Inside almost every project, there are some available datasets. When searching, you will need to be very precise to get the desired results.
Of course, there are a lot more sources, depending on your need. For example, if you need economic and financial datasets, you can visit World Bank Open Data, Yahoo Finance, EU Open Data Portal, etc. Once you have found your dataset, it’s Graphite time; run several models and create various reports using visualizations and tables. With Graphite, it's easier to make business decisions. Maybe you are just a few clicks away from the turning point of your career. 🙂
How to re-upload CSV files
Have you collected more data related to your uploaded CSV or there has been a change in the data you have uploaded? Don't worry, we thought about that and added a re-uploading option.
For example, if the new data is in the CSV file along with the old data (the ones you uploaded), you can re-upload it as a fresh new dataset. The same thing should be done if there has been a change in the uploaded data. On the other hand, if the new data you want to add is in another CSV file (not along with the uploaded data), you can append it to the previous dataset. But there is a small catch thou: the file you selected must have the same column structure as the previous (uploaded) file!
To re-upload your data, you have to
Go to Datasets list
Select the dataset you want to re-upload
Depending on your needs, you can select Append data
Select or drop your CSV file
For example, this is a useful thing for monthly data. Imagine getting a CSV file with certain data every month and you need to merge the data for all the months into one CSV file. Instead of repeating the copy and paste commands, with a few clicks you just add data to an existing dataset and so on every month. Ta-da, your new dataset is ready! 🙂
How to upload CSV files
First things first, you have to upload your CSV file(s) into Graphite. But don’t worry, you are only a few steps away from the beginning of your Graphite journey.
Follow these simple steps:
Select CSV file
If you want, you can name the dataset and write down a short description of the data. Also, you can select or create a tag that you will be able to join to your model and notebook for better organization among the files.
Select or drop your CSV file
Choose your parsing options
Select Parse (you can rename or change the data type of your columns)
Later, you will be able to connect to the database and extract data from it. Until then, prepare your CSV file and create your first dataset in Graphite and start modeling. Enjoy! 🙂