Abstract

Software cost estimation is a critical task in software projects development. It assists project managers and software engineers to plan and manage their resources. However, developing an accurate cost estimation model for a software project is a challenging process. The aim of such a process is to have a better future sight of the project progress and its phases. Another main objective is to have clear project details and specifications to assist stakeholders in managing the project in terms of human resources, assets, software, data and even in the feasibility study. Accurate estimation results with definitely helps the project manager to do better estimation for the project cost, the time required for various project phases and resources or assets. This paper builds a software cost estimation model using machine learning approach. Different machine learning algorithms are applied to two public datasets to predict the software cost in the early stages. Results show that machine learning methods can be used to predict software cost with a high accuracy rate.

Keywords: Machine learning, Cost estimation, Prediction, Weka, Algorithms, Classification, Prediction models.

Received: 12 June 2019 / Revised: 15 July 2019 / Accepted: 20 August 2019/ Published: 27 September 2019

Contribution/ Originality

This study contributes to the existing literature by enhancing the results of thirteen Machine Learning algorithms on two datasets. The evaluation criteria used in this work are R², MAE, RMAE, RAE, and RRSE. The aim of the proposed model is to predict the effort using dataset attributes and compare them with the actual effort in order to measure the error using different criteria.

1. INTRODUCTION

Software cost estimation is a critical stage that is being done in the initial phases of software development process. The aim of such a process is to have a better future sight of the project progress and its phases. Another main objective is to have clear project details and specifications to assist stakeholders in managing the project in terms of human resources, assets, software, data and even in the feasibility study. Accurate estimation results with definitely helps the project manager to do better estimation for the project cost, the time required for various project phases and resources or assets. However, the inaccuracy may result from the project cost estimation process that will certainly affect the project delivery. A project with wrong or imprecision evaluation will face issues with delivery timing, resources required, budget or even in quality or operational side and sometimes the project may fail or aborted. Hence, the cost estimation is a significant part of the software projects and so it continues to be a complex issue in the software engineering field [1]. Therefore, many studies and researches have been conducted for the purpose of enhancing and improving the estimation process and get more accurate and dependable results.

On the other hand, recently machine learning (ML) techniques become very essential in software studies. In many scientific researches, ML methods are being used and executed most likely in the various fields, however, depending on the research nature and objectives one or more of the methods will be selected. As the process of software cost estimation is rapidly evolving which may include technology advances, team skills and experience, and tools and programming languages available, it gives superiority to ML techniques than some other methods that may stick to statistical and mathematical work [2]. Hence, ML can be a suitable technique to build the proposed model due to the ability to learn from historical data and adapt the wide variations that join software project development. In this work, ML techniques will be used to evaluate and compare the results of implementing such techniques on datasets. Dataset will be collected from the public that available on internet called Upsp05 and Upsp05-tf that contains practical software engineering data. Datasets can be downloaded from (http://tunedit.org/repo/promise/effortprediction) which is made publicly available to encourage and improve the cost estimation work in software engineering.

By applying ML methods on the dataset, it can be concluded if the ML techniques could be applied successfully on software cost estimation data or not. If yes, it would be possible to know which method scored the best results and also it is likely to decide if an ML model can be developed to evaluate and estimate the software cost.

2. RELATED WORK

2.1. Software Cost Estimation Review

Many studies proposed different models for estimating software cost. Several models have proposed and built to find alternatives, enhance or support existing models. Constructive cost model (COCOMO) is considered as one of the most known models in the field of software cost estimation. For the purpose of improving COCOMO II accuracy model [3] proposed a method to optimize the parameters used in COCOMO II model. The proposed model is capable of candling improper and unclear inputs in an efficient way and so improves software reliability. In 2011, a study [4] aimed to examine the results of applying fuzzy logic on COCOMO II and FL-COCOMO II and its effect on cost estimation. The research focused on SCE model and incorporating fuzzy logic to assess the inaccuracy of software attributes. The study’ outcomes showed from applying various datasets that FL-COCOMO II model scored better estimation outcomes than the COCOMO II based on different assessment criteria.

Litoriya, et al. [5] conducted an analysis of the cost drivers that affect directly the cost estimation model accuracy and a substitution process has done for those drivers with nearest values to show and prove the decrease in the software cost using Agile COCOMO II. In 2017, Saljoughinejad and Khatibi [6] proposed a new study based on COCOMO model to enhance and improve the accuracy of the software cost estimation process. The COCOMO model has been selected due to its flexibility and applicability to various types of projects. During the study, an analysis of cost drivers has been done using different meta-heuristic algorithms. The improvement process was based on the effective selection of the factors and coefficients used in the model. The improvement was on comprises cost drivers and coefficients estimations. Results showed that explicit superiority once a comparison is done between the proposed model and COCOMO or other models. Chen, et al. [7] focused on software cost estimation using different models such as COCOMO and how it can be improved by using the WRAPPER feature in DM. It is basically depending on feature subset selection (FSS), where it concentrates on mainly the most promising fields in the dataset and neglects the other to speed up the processing time. So by using the WRAPPER feature, they concluded that COCOMO model can be improved and its results could be more efficient. Khalifelu and Gharehchopogh [8] presented various software cost estimation models founded on data mining methods for the purpose of choosing appropriate Artificial intelligence (AI) techniques that are needed in new projects. Their main aim of all experiments was to asset and compare different data mining approaches with intermediate COCOMO models with respect to the prediction’s accuracy. The achieved results were promised and noticeable.

Beside many researches done on the COCOMO model and how to enhance it, there are other studies conducted trying to propose new models and algorithms that may add value to the field. Whigham, et al. [9] suggested a baseline model that is essential to be used for all projects that are being developed and software effort estimation study is required. It is very important to compare the results with the baseline when a new or existing method is carried out to examine the prediction process. Their proposed model called automatic transformations linear model (ATLM) that could be used as a baseline for the comparison process between the software effort estimation methods.

Sarro, et al. [10] introduced a new effort estimation algorithm that is based on a combination of confidence interval analysis and assessment of the mean absolute error (MAE). The developed algorithm has been tested and evaluated based on three different factors, however, the results were very promised. The experiments were done on the dataset that is collected from more than 700 software projects. Recently, Masoudi-Sobhanzadeh, et al. [11] proposed a novel software model for feature selection. It can be applied in many fields including designing drug, biology, and image processing. The model starts by selecting a subset of features/factors based on optimizations algorisms to be transmitted later to the classifiers or learners. The learners are SVM, ANN and Decision Tree, which can be applied to regressions and classification datasets. Two types of optimization algorithms and the three classifiers can be applied by researchers to any dataset to use this model which is called FeatureSelect. The FeatureSelect has been tested on 8 different datasets in size and nature were great results have been observed.

2.2. Machine Learning Review

On the other hand, machine learning algorithms considered to be vital in nowadays studies. ML techniques are widely used and their outcome results are dependable and reliable in many researches. In 2018, a deep study [12] did an analysis of 25 releases containing hundreds of classes to test the effort indicators. The study concluded that out of 18 machine learning algorithms used, IBk, KStar, Additive Regression, and Multi-Layer Perceptron were capable to estimate the test effort precisely. Moreover, the work in Khalid, et al. [13] proposed a prediction model to estimate the duration of the software processes using ML algorithms. Two training models which are Levenberg–Marquardt (LM) and Bayesian regularization back propagation (BR) used to test and evaluate FFNN and RBNN algorithms. The comparison between the two models showed that BR outcomes are slightly better. Also, BR is more desirable as its application is cost effective. Yeh and Deng [14] proposed a framework to predict the software product life cycle using two machine learning algorithms. The work presented a more precise and generalizable model for product cost estimation.

In 2019, research has been done on breast cancer [15] trying to build models for visualizing and detecting analytical signs of breast cancer survival rate using ML algorithms. To determine the important aspects of breast cancer survival rate, prediction algorithms were developed using the extreme boost, decision tree, neural networks, random forest support vector machine, and logistic regression. All the algorithms scored very high and close outcomes and the highest one was random forest which can be concluded that those methods could be used as predictive models in breast cancer studies. Some major challenges usually evolve software cost estimation process such as factors related to technology and creativity. In addition, some uncertain problems can happen during the implementation like lack of resources or increasing the cost of OS. Due to that, Kumari and Pushkar [1] introduced a hybrid algorithm to better estimate the SCE based on a combination of COA-Cuckoo and KNN. The hybrid algorithm runs on 6 different datasets and evaluated using 8 examined criteria. The overall results show improved accuracy on cost estimation.

Pospieszny, et al. [2] developed an approach based on machine learning algorithms to limit the gap between recent researches and real implementations. The achieved results for the proposed model were very accurate and authors believe that it gives more realistic results in terms of software effort and duration estimation for practical projects. Furthermore, Pandey [16] did an analysis study on most of the techniques and methods used in software cost estimation. He tried to mention some of the main advantages and drawbacks of each one. He concluded that all project factors are important and critical to evaluate and estimate the cost of a project and they can differ in their important and influence from one project to another. The main factors that should be included in estimation metrics are qualitative: team experience, development environment, and culture; quantitative factors are project size and the available resources. Başkeleş, et al. [17] carried out many experiments on software effort estimation using different machine learning methods based on three main dissimilar datasets. As a conclusion, they have noticed that parametric models are inadequate for software effort estimation process.

Some other studies focused more on the cost estimation environment and other related factors like software development cycle associated with each project. For example, in 2018, Rahikkala, et al. [18] introduced a study on the role of the organizational phenomena and how its various factors can improve and influence the process of software cost estimation. Most of the researches focus on the development and improvement of the SCE including technical factors and methodologies without mentioning or analyzing the impact of the environment or organizational factors. By conducting a case study and quantitative research, the authors concluded that senior management responsibility is essential in making a significant estimation, whereas the daily follow up is not required. They also found that no significant individual factors that may affect directly the estimation process. Another work [19] conducted on Agile Development Life Cycle. Due to its high success rates and because of its rapid nature capability to adopt changes, Agile Development Cycle became very popular and widely used since the late 90s. Vyas, et al. [19] did a great job in 2017 by doing a survey to highlight the most trends related to the Agile Software Development (ASD) process and how it relates to the cost estimation. The authors were able to identify the factors to be included or not in the estimation process in order to get a more accurate and true cost for projects.

3. METHODOLOGY

In this section, the work methodology is discussed through the illustration of some of the main ML algorithms and the assessment methodology, the structure of used datasets and its attribute description and finally the evaluation criteria.

3.1. Machine Learning Algorithms

In this section, a quick and brief overview is done on the ML methods that have been used in building prediction models.

Random forest is a machine learning algorithm constructed on decision tree algorithms. It operates like a group of constructed decision trees that works independently where each tree is using a distinctive part of the dataset. In each tree, it consists of two randomization levels. The first one is called “bagging” or bootstrap of aggregation and the other level is at each node of the decision tree [20]. REPTree is an abbreviation of Reduced Error Pruning Tree. The tree is being built in a fast and learnable way depending on the gained information. REPTree is another kind of decision trees that uses regression tree, which can create many trees in various rounds or iterations. Then, out of all the generated trees, the best one is being selected. To do the pruning process for the tree, the mean square error is measured based on the tree predictions [21]. M5P method is another tree model that is being constructed based on Quinlan’s M5 algorithm. Originally, in addition to adding the linear regression method to the tree leave nodes, M5 model is also based on the conventional decision tree. The trained data is used by the algorithm to form the nodes and represent the decision tree model [22]. The ZeroR algorithm is one of the simplest classifiers. It considers all required potential values and the attributes being targeted. Based on the provided data and using the target attribute, the required output will always be found. This classifier does not have a rule that depends on the untargeted attribute [23].

The Decision table classifier is a classification model used in prediction studies. Its idea is similar to decision trees or neural networks, where it involves a hierarchal table in which each row is a top level being broken down to construct another new table. Its structure is very close to dimensional stacks [24]. Input Mapped Classifier works like a wrapper that specifies the mismatches between test and training data by trying to build a relation between data used in training which the classifier has been constructed based on and the received test cases or instances [25]. Additive Regression classifier improves the performance of classifiers that are based on regression. Each iteration done uses the residuals generated from previous iterations. It can overcome overfitting problem but it takes extra time [26]. IBK stands for Instance-Bases Learning with parameter K. Its will-known name is K-nearest Neighborhood (KNN), while it is used in Weka software as IBK. It determines the number of nearest neighbors to be used in the instance classification process [27]. In 2009, K-Star classifier was first introduced by Hussain Aljazzar. K means the number of shortest paths that can be found between a selected group of data points in a given graph [28].

Gussian Processes classifier can be considered as an example of non-parametric algorithms. It is used to deduce a distribution of random variables collection over functions. It attempts to find the similar points within the distribution to forecast the value [29]. Linear regression is a ML algorithm that is classified under supervised learning. It predicts values based on the independently provided attributes. It is widely used in finding relationships between datasets attributes and prediction studies. So this classifier tries to find a linear relationship between the input values and the one to be predicted [30]. Multi-Layer Perceptron is a neural network algorithm that consists of three main layers which are input layer, at least one hidden layer, then an output layer. Depending on the dataset and the problem, one or more nodes can compose the output layer.

Table-1. Datasets attributes description.

Item no.	Attribute	Description	Type
1	ID	Object ID	Positive integer
2	Effort	The actual total hours expended on the implementing process	Positive integer
3	IntComplx	The level of the internal calculation’s complexity	1 to 5 that mean low to very high
4	DataEn	Total number of data-entry items	Positive number
5	DataFile	Total number of data-files accessed	Positive number
6	DataOut	Total number of data-output items	Positive number
7	UFP	Unadjusted function point count	Positive number
8	Lang	Language used	Language name
9	Tools	Used platforms and development tools	Tools/platform name
10	ToolExpr	Language and tool experience level	Range of number of months of experience
11	AppExpr	Applications experience level	1 to 5 which means low to very high
12	TeamSize	Size of the developing team	Range of min. to max. number
13	DBMS	Used DBMS	Database system name
14	Method	The used implementation methodology	e.g. OO, JAD
15	AppType	The used architecture	e.g. C/S, Centered
16	ObjType	Type of the object	PJ-project, FT- feature, RQ- requirement
17	Funct%	Percentage of functionality of features or requirements	1 to 7

Source: Dataset available at: http://tunedit.org/repo/PROMISE/EffortPrediction/usp05-ft.arff.

The input signals spread forward within the network while error signals spread backward. To reduce the error, some weight adjustments are being made [31]. SMOreg means Sequential Minimal Optimization that is an enhanced model of the SMO method which is based on support vector machine (SVM) used for regression. For un-linear prediction, SMOreg can be used in an efficient way. In SMOreg algorithm, some efficiency problems can be generated as a result of having one threshold [31].

3.2. Datasets and Evaluation Criteria

The datasets used in this work are prepared for software engineering experiments that are available publically. Two datasets will be used to test and compare the ML techniques, the first one is called usp05-ft and the second is usp05. The first one (Usp05-ft) contains 76 instances and consists of the first 15 attributes as described in Table 1.

Mean Absolute Error (MAE)

It is the absolute sum of the error divided by a number of predictions. The error is the difference between the actual effort and predicted effort, which can be calculated as follows:

Relative Absolute Error (RAE)

It is the sum of the absolute error divided by sum of absolute relative error. The error is the difference between the actual effort and predicted effort, while the relative error is the difference between the actual effort and its mean. RAE is computed using the following formula:

Root Relative Squared Error (RRSE)

It is the square root of the square error divided by the square relative error. The error is the difference between the actual effort and predicted effort, while the relative error is the difference between the actual effort and its mean. It is calculated by:

In addition to the 15 attributes, the second dataset (Usp05) contains 203 instances and consists of two more attributes which are 16 and 17 as described in Table 1. To evaluate the ML algorithms performance after applying them on the used datasets, five basic statistical indices were used as performance and assessment criteria. The indices are Mean Absolute Error, Root Mean Squared Error, Relative Absolute Error, and Root Relative.

Correlation Coefficient(R²)

The correlation coefficient shows how the actual effort and predicted effort are related. It gives a value between -1 and 1. The correlation is 1 when the values increase together, and it goes down to -1 when there is no relation between the values. R² is computed as the following formula:

Squared Error, and Correlation Coefficient. Basically, they are measuring the error rate between the actual effort within the dataset and the predicted effort using the ML algorithm.

4. EXPERIMENTAL RESULTS

Weka tool version 3.8 has been used to evaluate the used algorithms. The 10 folds' cross-validation technique is used to train the data on 90% and test it on the remaining 10% until the whole data (100%) is being used as a test data through 10 cycles. Six types of classifiers have been used in Weka to test the 13 algorithms. The used classifiers or algorithms are Random Forest, REPTree, M5P, ZeroR, Decision Table, Input Mapped Classifier, Additive Regression, IBK, KStar, Gussian Processes, Linear Regression, Multilayer Perceptron, and SMOreg. The 13 methods have been tested on two datasets, the first one is (Usp05-ft) with 15 attributes and 76 instances. The second one (Usp05) is with 17 attributes and 203 instances.

Table 2 shows the results of applying the 13 algorithms on the dataset (Usp05-ft). The evaluation and comparison were done on 5 statistical error measures. The last row in Table 2 represents the average values for each measurement criteria.

Table-2. Performance results on usp05-ft dataset.

Method	R²	MAE	RMSE	RAE %	RRSE %
Random forest	0.8441	2.5025	4.8546	41.7323	55.6778
REP tree	0.7607	3.2553	5.6754	54.2868	65.0918
M5P	0.705	3.2359	6.2428	53.9628	71.5994
ZeroR	-0.2644	5.9965	8.7191	100	100
Decision table	0.7137	3.4225	6.241	57.0738	71.579
Input- mapped- classifier	-0.2644	5.9965	8.7191	100	100
Additive- regression	0.7136	3.2441	6.5709	54.0993	75.363
IBK	0.7853	2.5132	5.8021	41.9101	66.5451
KStar	0.7797	2.7272	6.0219	45.4795	69.0658
Gussian- processes	0.7604	2.8814	5.6809	48.0511	65.1554
Linear- regression	-0.2644	5.9965	8.7191	100	100
Multilayer- perceptron	0.7979	2.8173	5.6413	46.9824	64.7004
SMOreg	0.7504	2.6597	6.2764	44.3547	71.9853
Average	0.5244	3.6345	6.5511	60.6102	75.1356

Table-3. Best and worst results on usp05-ft dataset.

Criteria	R²	MAE	RMAE	RAE%	RRSE%
Best result	0.8441	2.5025	4.8546	41.7323	55.6778
Algorith m	Random forest	Random forest	Random forest	Random forest	Random forest
Worst Result	-0.2644	5.9965	8.7191	100	100
Algorith m	ZeroR, input mapped classifie r, and linear reg- ression	ZeroR, input mapped classifie r, and linear reg- ression	ZeroR, input mapped classifie r, and linear reg- ression	ZeroR, input mapped classifie r, and linear reg- ression	ZeroR, input mapped classifie r, and linear reg- ression
Average	0.28985	4.2495	6.78685	70.8662	77.8389

Figure 1 is representing the distribution of the values for the best results scored by Random Forest algorithm. The points in both figures show the differences between the actual effort points and the predicted ones using the random forest model. Figure 2 representing the values for the worst results scored by model ZeroR where the predicted results are being represented as a flat line.

Figure-1. Predicted and actual values distribution using random forest.

Figure-2. Predicted and actual values distribution using ZeroR.

Table 3 shows the best and worst results scored among the 13 algorithms that have have been applied to the first dataset (Usp05-ft). In all types of statistical measurements used to test and evaluate the error rate, Random Forest scored the best results in all the indices with the lowest error rate for the four types (MAE, RMAE, RAE, RRSE) and highest correlation.

Table 4 represents the performance measurement results after applying the 13 algorithms on the second dataset (Usp05). Two more attributes have been added to the dataset and the number of instances is almost three times than the first dataset. The last row is showing the average value for each measurement criteria.

Table-4. Performance results on usp05 dataset.

Method	R²	MAE	RMSE	RAE %	RRSE %
Random forest	0.4319	8.1464	31.8046	60.5244	92.3641
REP tree	0.6296	10.0367	29.757	74.569	86.4177
M5P	0.4628	9.9475	30.6858	73.9061	89.1151
ZeroR	-0.3113	13.4596	34.4339	100	100
Decision table	0.5118	8.7527	30.1391	65.0297	87.5273
Input- mapped- classifier	-0.3113	13.4596	34.4339	100	100
Additive- regression	0.558	8.445	28.2736	62.7431	82.1098
IBK	0.2504	8.7266	38.9912	64.8355	113.2349
KStar	0.3626	7.3318	32.1284	54.4729	93.3046
Gussian- processes	0.4978	8.2471	29.659	61.2729	86.1333
Linear- regression	-0.3113	13.4596	34.4339	100	100
Multilayer- perceptron	-0.022	15.7926	36.513	117.3334	106.038
SMOreg	0.4962	8.8851	29.5932	66.0134	85.9422
Average	0.2496	10.3608	32.3728	76.9770	94.0144

Compared to the results of the first dataset, the experimental results are varying from the first one. In Table 4, it is noticeable the change of the error percentage as all error measures become higher except the correlation coefficient becomes lower. Table 5 represents the worst and best results scored by various applied algorithms. It can be observed that the best results are between 3 algorithms which are REPTree, KStar and Additive Regression, while the worst results have been scored either by IBk or Multilayer Perception. For the R², the worst results have been scored by ZeroR, Linear Regression, and Input Mapped Classifier. The last row in Table 5 is the average value between the best and worst scored results.

Table-5. Best and worst results on usp05 dataset.

Criteria	R²	MAE	RMAE	RAE %	RRSE %
Best result	0.6296	7.3318	28.2736	54.47	82.1098
Algorith m	REPTree	KStar	Additive Regressi on	KStar	Additive Regressi on
Worst Result	-0.3113	15.7926	38.9912	117.3334	113.23
Algorith m	ZeroR, linear regression, Input mapped classifier	Multilay er Perceptr on	IBk	Multilay er Perceptr on	IBk
Average	0.15915	11.5622	33.6324	85.9017	97.6699

Figure 3 and Figure 4 are representing the distribution of the values for the best error results scored by the models which are Additive Regression and KStar. The points in both figures show the differences between the actual effort points and the predicted ones using the Additive Regression and KStar algorithms. Figure 5 and Figure 6 representing the values for the two worst error results scored by IBk and Multilayer Perceptron models.

Figure-3. Predicted and actual values distribution using additive regression.

Figure-4. Predicted and actual values distribution using KStar.

Figure-5. Predicted and actual values distribution using IBk.

Figure-6. Predicted and actual values distribution using multilayer perceptron.

5. CONCLUSION AND FUTURE WORK

In this work, 13 ML algorithms have been evaluated using two datasets. The evaluation criteria used in this work are R², MAE, RMAE, RAE, and RRSE. The aim of the proposed model is to predict the effort using dataset attributes and compare them with the actual effort in order to measure the error using different criteria. The higher the value of R² the better result, for the rest of the measurement criteria, the lower value means a better result. Random Forest achieved the best results in the first experiment using (Usp05-ft) dataset and another three models which are REPTree, Additive Regression and Kstar scored the best results using (Usp05) dataset. ZeroR method scored the worst results using the first dataset while some other methods including Multilayer Perceptron, IBk, and Linear Regression didn't perform well on the second dataset.

In the future, more datasets can be included in the study to have a wider angle and more variety in the inputs which will be reflected to have a better estimation and more accurate results. Moreover, some other ML algorithms can be tested and involved in upcoming studies in order to cover all available machine learning methods.

Funding: This study received no specific financial support.

Competing Interests: The author declares that there are no conflicts of interests regarding the publication of this paper.

REFERENCES

[1] S. Kumari and S. Pushkar, "Cuckoo search based hybrid models for improving the accuracy of software effort estimation," Microsystem Technologies, vol. 24, pp. 4767-4774, 2018. Available at: https://doi.org/10.1007/s00542-018-3871-9.

[2] P. Pospieszny, B. Czarnacka-Chrobot, and A. Kobylinski, "An effective approach for software project effort and duration estimation with machine learning algorithms," The Journal of Systems & Software, vol. 137, pp. 184–196, 2018. Available at: https://doi.org/10.1016/j.jss.2017.11.066.

[3] K. Langsari, R. Sarno, and Sholiq, "Optimizing effort parameter of COCOMO II using particle swarm optimization method," Telkomnika, vol. 16, pp. 2208-2216, 2018. Available at: https://doi.org/10.12928/telkomnika.v16i5.9703.

[4] I. Attarzadeh and S. H. Ow, "Improving estimation accuracy of the COCOMO II using an adaptive fuzzy logic model," presented at the 2011 IEEE International Conference on Fuzzy Systems, Taipei, Taiwan, 2011.

[5] R. Litoriya, N. Sharma, and D. A. Kothari, "Incorporating cost driver substitution to improve the effort using Agile COCOMO II," presented at the 2012 CSI Sixth International Conference on Software Engineering, 2012.

[6] R. Saljoughinejad and V. Khatibi, "A new optimized hybrid model based On COCOMO to increase the accuracy of software cost estimation," Journal of Advances in Computer Engineering and Technology, vol. 4, pp. 27-40, 2018.

[7] Z. Chen, T. Menzies, D. Port, and B. Boehm, "Feature subset selection can improve software cost estimation accuracy," ACM SIGSOFT Software Engineering Notes, vol. 30, pp. 1-6, 2005. Available at: https://doi.org/10.1145/1082983.1083171.

[8] Z. A. Khalifelu and F. S. Gharehchopogh, "Comparison and evaluation of data mining techniques with algorithmic models in software cost estimation," Procedia Technology, vol. 1, pp. 65-71, 2012. Available at: https://doi.org/10.1016/j.protcy.2012.02.013.

[9] P. A. Whigham, C. A. Owen, and S. G. Macdonell, "A baseline model for software effort estimation," ACM Transactions on Software Engineering and Methodology, vol. 24, pp. 1-11, 2015. Available at: https://doi.org/10.1145/2738037.

[10] F. Sarro, A. Petrozziello, and M. Harman, "Multi-objective software effort estimation," presented at the ACM 38th IEEE International Conference on Software Engineering, 2016.

[11] Y. Masoudi-Sobhanzadeh, H. Motieghader, and A. Masoudi-Nejad, "Feature select: A software for feature selection based on machine learning approaches," BMC Bioinformatics, vol. 20, pp. 1-17, 2019. Available at: https://doi.org/10.1186/s12859-019-2754-0.

[12] V. Vig and A. Kaur, "Test effort estimation and prediction of traditional and rapid release models using machine learning algorithms," Journal of Intelligent & Fuzzy Systems, vol. 35, pp. 1657-1669, 2018. Available at: https://doi.org/10.3233/jifs-169703.

[13] A. Khalid, M. A. Latif, and M. Adnan, "An approach to estimate the duration of software project through machine learning techniques," Gomal University Journal of Research, vol. 33, pp. 1-13, 2017.

[14] T.-H. Yeh and S. Deng, "Application of machine learning methods to cost estimation of product life cycle," International Journal of Computer Integrated Manufacturing, vol. 25, pp. 340-352, 2012. Available at: https://doi.org/10.1080/0951192x.2011.645381.

[15] M. D. Ganggayah, N. A. Taib, Y. C. Har, P. Lio, and S. K. Dhillon, "Predicting factors for survival of breast cancer patients using machine learning techniques," BMC Medical Informatics and Decision Making, vol. 19, pp. 1-17, 2019. Available at: https://doi.org/10.1186/s12911-019-0801-4.

[16] P. Pandey, "Analysis of the techniques for software cost estimation," presented at the 2013 Third International Conference on Advanced Computing and Communication Technologies (ACCT), Rohtak, India, 2013.

[17] B. Başkeleş, B. Turhan, and A. Bener, "Software effort estimation using machine learning methods," presented at the 2007 22nd International Symposium on Computer & Information Sciences, 2007.

[18] J. Rahikkala, S. Hyrynsalmi, V. Leppänen, and I. Porres, "The role of organisational phenomena in software cost estimation: A case study of supporting and hindering factors," E-Informatica Software Engineering Journal, vol. 12, pp. 167–198, 2018.

[19] M. Vyas, A. Bohra, D. C. Lamba, and A. Vyas, "A review on software cost and effort estimation techniques for agile development process," International Journal of Recent Research Aspects, vol. 5, pp. 612-618, 2016.

[20] S. A. Woznicki, J. Baynes, S. Panlasigui, M. Mehaffey, and A. Neale, "Development of a spatially complete floodplain map of the conterminous United States using random forest," Science of the Total Environment, vol. 647, pp. 942-953, 2019. Available at: https://doi.org/10.1016/j.scitotenv.2018.07.353.

[21] S. Kalmegh, "Analysis of weka data mining algorithm reptree, simple cart and randomtree for classification of Indian news," International Journal of Innovative Science, Engineering & Technology, vol. 2, pp. 438-446, 2015.

[22] S.-A. Blaifi, S. Moulahoum, R. Benkercha, B. Taghezouit, and A. Saim, "M5P model tree based fast fuzzy maximum power point tracker," Solar Energy, vol. 163, pp. 405-424, 2018. Available at: https://doi.org/10.1016/j.solener.2018.01.071.

[23] T. Rajasekaran, P. Jayasheelan, and K. S. Preethaa, "Predictive analysis in agriculture to improve the crop productivity using zeroR algorithm," International Journal of Computer Science and Engineering Communications, vol. 4, pp. 1397-1401, 2016.

[24] B. G. Becker, "Visualizing decision table classifiers," in Proceedings IEEE Symposium on Information Visualization, 1998.

[25] Class Input Mapped Classifier, Available: http://weka.sourceforge.net, 2019.

[26] Additive Regression, Available: https://www.cs.waikato.ac.nz/ml/weka/, 2019.

[27] Gerardnico, "Machine learning - K-nearest neighbors (KNN) algorithm - instance based learning." Available: https://gerardnico.com/, 2017.

[28] University of Konstanz, "K* Algorithm (K Star)." Available: https://www.sen.uni-konstanz.de/, 2019.

[29] M. Krasser, "Gaussian processes." Available: http://krasserm.github.io, 2018.

[30] Geeksforgeeks, "ML linear regression." Available: https://www.geeksforgeeks.org, 2019.

[31] P. Singh and S. Agrawal, "Node localization in wireless sensor networks using the M5P tree and SMOreg algorithms," presented at the 2013 5th International Conference and Computational Intelligence and Communication Networks. IEEE, 2013.

Views and opinions expressed in this article are the views and opinions of the author(s), Review of Computer Engineering Research shall not be responsible or answerable for any loss, damage or liability etc. caused in relation to/arising out of the use of the content.

Index