Researchers have made great efforts to develop various techniques to envision the number of defects in software at the code level . In addition, several studies have proposed approaches for improving the outcome of software defect prediction with a focus on binary classification, for example, , , , , , , , , , . These studies have focused on improving the performance of learning algorithms. For instance, while investigating methods of improving the classification accuracy of learning algorithms, Galar et al.  and He  reported that the performances of these algorithms can be improved by considering combinations of individual metrics when confronting challenges related to imbalanced data. Lessmann et al.  evaluated the performance superiority of one classifier over another. Using an alternative improvement approach, Taba et al.  investigated how to increase the performance of a bug prediction model by means of metric-based antipatterns. Batista et al.  performed an extensive experimental assessment of the performances of learning algorithms when applied to imbalanced data and reported that class imbalance does not completely prevent the successful application of learning algorithms. Notably, such learning algorithms are capable of classifying a module as either defective or defect-free. Petrić  reported that the current prediction models handle defect prediction in a black-box manner. This is a weakness of the existing prediction models since they do not enable the prediction of the number of defects but rather focus on classification, which is merely an attempt to forecast whether a software program will be defective .
To enable the estimation of the number of defects in a new software version, our previous work  proposed certain prediction models constructed using the code design complexity, defect density, defect introduction time and defect velocity. The results obtained suggest that the number of defects in a new release can be achieved at the class level of the software. Beyond the work reported in , none of the existing studies has considered the prediction of the numbers of software defects at both the class and method levels using the aforementioned variables. Hence, the current study aims to extend this approach based on derived variables to forecast the number of defects expected to be present in a future software version at the method level. One of the drawbacks of the previous study is that the proposed approach was not applied at the method level; in addition, that study did not report detailed information on the regression models, such as percentage errors. Based on the above concerns, we can conclude that the quality of the data applied in that study was low. If the size of a software program can be predicted, as reported by Laranjeira  and Dolado , then a method of predicting the possible number of defects likely to be present in a future release at the method level is also needed. Hence, the current study attempts to provide such an approach. Therefore, to fill this gap in the literature regarding method-level defect prediction, the current study presents a detailed analysis of the approach applied and clear evidence of reliable datasets. Above all, the accuracy of these prediction models depends on the quality of the datasets used, which determines whether the learning algorithms applied to these data can learn accurately. In addition, sufficient dataset quality will enable these learning algorithms to gain and transfer knowledge within and across projects. Notably, the quality of the information gained from a reliable data source can enhance the intertask learning mechanism, as reported by the authors of . Furthermore, the authors of  confirmed that datasets are imbalanced and inaccurate by nature and consequently must be freed from bias before any learning algorithm can be applied. By the same token, if a model is trained with a reliable dataset, this can increase the efficiency of the model as well as optimizing its output . Therefore, there is a need to devise an accurate means of acquiring reliable datasets suitable for training prediction models to achieve high performance . Therefore, to ensure that the datasets applied in prediction studies are of high quality, this manuscript presents a step-by-step data preprocessing approach to ensure that these datasets are free from bias before the application of learning algorithms to avoid misleading results. 2b1af7f3a8