Any Data Science expert will tell you that it’s always better to have too much data than too little. And, for Deep Learning, it couldn’t be more true – the more examples you have, the more accurately the connections between neurons correspond to the chain of transitions based on which the system will make a decision.
There are also methods that help calculate the minimal dataset needed depending on the task you’re trying to solve. For example, historically, for the classification using Deep Learning the rule of thumb would be 1k of samples per class. From my own experience, I can tell that this number can shrink if you use pre-trained models, suited to your classification. In my case, the use of a pre-trained model for facial recognition or facial identification allowed me to only use 10 images of a person to be able to successfully identify them.
Of course, you shouldn’t forget about the quality of the data. For example, an imbalanced dataset will negatively impact the results of a binary classification, because one class will dominate in terms of the number of samples inside a dataset. The problem can be solved by means of evaluating not the accuracy, but the precision and recall, using imbalance correction techniques. However, according to this research, the increase of the dataset will be a much better solution to this problem.
Before getting into details, let’s understand the process of gathering data in machine learning.
Understanding the Data Collection Process
The information data that is formatted in a particular way is called data. Hence, data collection is the process of gathering, measuring, and analyzing accurate data from a variety of relevant sources to find answers to research problems, answer questions, evaluate outcomes, and forecast trends and probabilities. Accurate data collection is necessary to make informed business decisions, ensure quality assurance, and keep research integrity. During data collection, the researchers must identify the data types, the sources of data, and what methods are being used.
1. Defining the Problem Statement
Clearly outline the objectives of the data collection process and the specific research questions you want to answer. This step will guide the entire process and ensure you collect the right data to meet your goals.
Also, it is recommended to identify data sources. Determine the sources from which you will collect data. These sources may include primary data (collected directly for your study) or secondary data (previously collected by others). Common data sources include surveys, interviews, existing databases, observation, experiments, and online platforms.
2. Planning Data Collection
In this stage, it is better to start with the selection of data collection methods. Choose the appropriate methods to collect data from the identified sources. The methods may vary depending on the nature of the data and research objectives.
Common methods include:
- Surveys: Structured questionnaires administered to a target group to gather specific information.
- Interviews: Conducting one-on-one or group conversations to gain in-depth insights.
- Observation: Systematically observing and recording behaviors or events.
- Experiments: Controlling variables to study cause-and-effect relationships.
- Web scraping: Extracting data from websites and online sources.
- Sensor data collection: Gathering data from sensors or IoT devices.
The next logical step would require designing data collection instruments. If using surveys or interviews, create well-structured questionnaires or interview guides. Ensure that the questions are clear, unbiased, and aligned with the research objectives.
After these steps have been accomplished you may start the process of piloting the data collection process. Before the main data collection, conduct a pilot study to test the data collection instruments and methods. This helps identify any issues and allows for necessary adjustments to improve the data quality.
Only after that – collecting data. Implement the data collection process according to the chosen methods and instruments. Ensure the data is recorded accurately and completely.
3. Ensuring Data Quality
The next step is very crucial. Ensuring data quality means reviewing the collected data to check for errors, inconsistencies, or missing values. Apply quality assurance techniques to ensure the data is reliable and suitable for analysis.
The following step would be data storage and management. It will require organizing and storing the collected data in a secure and accessible manner. Consider using databases or other data management systems for efficient storage and retrieval.
After that, you might need to go with data cleaning and preprocessing. Prepare the data for analysis by cleaning and preprocessing it. This involves handling missing values, removing duplicates, and transforming data into a suitable format.
