View of Prediction of Air Pollution Using Random Forest

(1)

Prediction of Air Pollution Using Random Forest

1Sahi Singh, ¹Ayush Yadav, ¹Akhilseh Kumar

1computer science and engineering, Galgotias University, Established under galgotias university act 14 of 2011, GreatorNoida,INDIA

Abstract—The aim of this project is to use a heterogeneous ensemble of differential evolution with random forest technique for pollution prediction in Indian Capital. This method is different from existing work like mathematical model for the prediction of air quality of the region. We will Random forest algorithm for the prediction of air pollution in the urban region of New Delhi. A random forest algorithm is an ensemble method for regression and classification. We will determine the presence of air pollutants like carbon mono-oxide, carbon dioxide, nitrogen , sulphar dioxide, ozone, pm2.5, pm10. We will also use many more data for our purpose like meteorology data, point of interest data, traffic and road data for determining air quality of New Delhi. This paper will work on an algorithm called Random Forest, the random forest algorithm will be used as classification. The main aspect for determining the air pollution of any region, we will need to know about the real-time information of air quality from monitoring stations like PM2.5[25], PM10, NO2 as these are main air pollutants which causes more damages to human beings and our environment. If we talk about air quality then particulate matter is main cause effecting the health of humans and management of city. It also affect government policies. However, in big cites there are very few monitoring stations of air quality.So, in our paper we will talk about Random forest Technique to predict and Measure Pollution This algorithm is also used for data training and prediction.

Keywords—prediction of air quality; random forest; air quality index; traffic data; RAQ algorithm; methodology

I. INTRODUCTION

As we know that modernization tends modern growth, we know that all the transportation till now depends on fossil fuels such as petrol,diesel,gas,CNG etc. constantly. There is a very huge amount of vehicle use which constantly increases the air pollution as when a vehicle runs its releases harmful gases such as carbon-dioxide,mono-oxide,nitrogen etc. Now a days Urban air pollution is a very big problem in all the countries all over the world because the pollutants had a very bad affect on human human being, living things and air pollution can also be resulted in acid rain and greenhouse effect.

Disease like lung cancer are caused by these harmful pollutants.. Especially in India ,Now days pollution are the very massive drawback in massive cities like Old Delhi and urban center,where has waste product embrace exhaust emissions. In Old Delhi there ar regarding 5 million vehicles, coal burning in neighbour states. In 2019 November Delhi- NCR is sort of a death chamber as a result of pollution of non-electric vehicles and explosive of barmy in Diwali . In winters This pollution drawback is at its peak in Old Delhi -NCR as a result of whereas in summers this drawback is at moderate mode. Government has conjointly take steps measures to regulate this issues by taking actions on moter vehicle producing company because the government is language them to form vehicles that causes less pollution like bs-4,bs-5,bs-6etc. the govt is additionally reaching to ban recent vehicle as they releases additional pollution .

1. Literature survey:

Previously, there are very much studies on air quality use approach such as monitoring the air quality through satellite , monitoring the quality of air through wireless system network and dispersion model.

There also many model suggested for Random forest classification like Breiman. L[17], for the prediction accuracy Breiman. L.[2] also introduced out of bag estimation for Random forest classification.The mathematical model such as Eulerian model[20],this model is of different types

(2)

air quality measurement model. The main objective of these model is to find the pollution causing material after being released from the sources but they do not consider many conditions that causes problem to human health and concentration of air pollutants.At this time,these models depends on accurate data ,such as traffic emissions ,wind speed and so on but their accuracy of data cannot be right in some conditions, For example, the speed of wind is different in different regions this is because whether conditions and obstruction of building in determining over the structures.So using these models we will be only able to find the value by determining consumption of fuel and travelled distance by some specific particle or vehicle.

Finally due to this reason only we are going to use Random forest classification approach in this research paper.It is based on Random forest – approach.

A. Problem Description and Definition A.a.Air quality Index(AQI)

Talking about an air quality index then it is a type of number which is used by mainly all the government and private authorities. we all know that what happens when there is increase in indices for the measurement of AQI level in their countries.In this paper we are going to use the standard of India i.e. 24 hours of most pollutants for measuring air quality index level. In our country measuring of AQI is mainly based on atmospheric gases,that are nitrogen, ozone, mono-oxide, nitrogen dioxide, all the particulate matter which are below 10 micrometer like PM2.5[25], PM10 etc. The AQI value is the calculated per hour according to a formula published by India’s Central board of prevention and control of pollution.

1. Points of interest

A point of interest in short POI, is the type of specific point location at which one who finds their benefits or usually interested at that place which is useful for them.. e.g. stadium, ground, and shopping complex, complex, hotels etc. in our surrounding is consider as point of interest for example- if there is shopping complex around us then its obvious that there will be large gathering then in this modern area there will be more consumption of technology which cause pollution like vehicle,generator etc and contributing to the air quality of the that particular area[28].

2. RAQ algorithm

In the RAQ algorithm, first of all we will collect all the data from sensing system like we will collect the data from point of interest , the data will be also collected from traffic like we will collect the data from that place where there will be more traffic causing more pollution to the air, we will also collect the from monitoring stations like what is the AQI level of certain place? , we will also collect the meteorology data like data such as humidity, temperature, pressure of some specific place, then we will extract some more important features causing air pollution from this collected data and these data will be termed as feature extraction. Now, we will use random forest classification in our RAQ algorithm, now we will do bootstrap sampling then from these sampled data we will make decision tree for some specific data chosen from bootstrap, we will make decision tree for every sampled data with replacement means the data which is being used in some earlier decision tree might comes into new decision tree, when all the decision tree will be constructed then there will be a procedure called voting of tree for the final prediction of result. We will consider our result from that tree which will be voted fro the most of time and that predicted tree will be our final result. Hence, our RAQ algorithm works like the procedure explained above.

(3)

Fig. 1. Flow chart of RAQ algorithm.

A. Meteorology Data

What is Meteorology Data? The data like temperature[29] of some places where there is presence of more air pollutants,[29] presence of more humidity[29] at pollution causing place, pressure of that place are the very common factor which affects the spread of pollutants like NO2, SO2, ozone, ,PM2.5[25]etc. in the environment like if there will be more humidity at some place then there will be more chances of spreading air pollutants. Meteorology is most important factor in determining air quality of some reason because it is the only factor which leads to the spread of air pollutants directly from source to the environment like if there will be any change in the temperature of specific location then the air pollutants will be direct;y transported to air. In our research paper, we are going to use different types of meteorological data provided by the meteorological

department of India.

B. Traffic and Road Data

In our paper, we are mainly focus on two main characteristics of the traffic, which are length of road and slowdowns of traffic at that particular place. Now talking about the scenario that how road traffic data will be used in our paper, so when the road will be long then there will be less traffic on road and there will more emission of gases this will be because of total no of vehicle on the road, if the road will be heavy then there will be more congestion of traffic on road and the emission of gases will be low because of total no of vehicles present on the road if the road will be short and traffic congestion will be heavy. we do not have the method to find or observe that how much emission is being done on road directly. There are many application of map which provides online route of most of the road and traffic congestion on that specific road. Using these app[30,31] one can come to know that how much traffic is there at that road and which road will be now suitable for him to take for his/her journey. So with the help of these service providers we will come to know that how much traffic is there and how much emission of air pollutants will be there .As these application service providers do not give access to any one for their specific use but we will be still able to get some useful data for determining the quality of air for example if we came to know that some website is showing the presence of more traffic and pollution at some places then it will be very useful for us. Usually, these types of data will be collected directly from the GPS installed in vehicle or we can also determine through speed sensors.

C. POI Data

The category of points of interest is usually some specific place or region at which someone finds there benefits or the place which is useful for them, the gathering of people at some specific place tells us uses of land and traffic congestion at that place, like if there is a big shopping

(4)

this will led to use of vehicles which will cause air pollution because of emission of air pollutants from the vehicle used in that area, and leads to decrease the air quality of the region.

D. Random Forest Classification

It is one in every of the foremost used algorithm rule owing to its simplicity and additional in diversity we tend to use random forest algorithm rule for multi category classification.

Algorithm 1.

In:

A data set S with different features: Anq, Anh, Anp, Anw, A, Ari, Aqcs, An and labeled

Air Quality Index level;unlabeled data set R; quantity of tree Q; quantity of features N;Out: Air quality index level

A. Q no of trees

B. we will select random n features from S;

C. n feature in every node

D. gain by: k

Equation ; Entropy(c)=−∑p(ci)log2p(ci)

i=1

2. Evaluation

(iii)Evaluation Method

We will use one among these two we will use out of bag error to compare RAQ accuracy on the basis of different parameters pair i.e. tree and feature it mean the no. trees required for construction of trees and no. of feature required to construct a random forest.The error in Random forest is calculated internally at the time of building tree. Talking about an air quality index then it is a type of number which is used by mainly all the government and private authorities. Air pollution nearly most of the people will suffer with many health problem like most of the people will have breathing problem and there will be also bad environment around us. In the whole world different countries uses types of air quality indices for the measurement of AQI level in their countries.In this paper we are going to use the standard of India i.e. 24 hours of most pollutants for measuring air quality index level. In our country measuring of AQI is mainly based on atmospheric gases,that are nitrogen, ozone, mono-oxide, nitrogen dioxide, all the particulate matter which are below 10 micrometer like PM2.5, PM10 etc. The AQI value is the calculated per hour according to a formula published by India’s Central board of prevention and control of pollution.

E. we will choose the max gain to extract the data set in node.

F. we will remove used feature from the feature candidate;

G. Now we will input unlabeled data into the trees;

we will finally get predicted Air Quality Index level according to Equations (1) and (2);

(5)

Traini ng s1

Train ing s2

N samp

le

Test Set

Sampl e 1

Samp le 1

Sam ple

Prediction of result

Voting Collected data

set

(S02,lead,nitro gen etc.) Training set 3. Methodology

How does the Random Forest algorithm will work in this paper?

As we have already discussed that what is Random Forest? A random forest is ensemble classification which uses many tree to predict the final result of the specific problem

a) First, we will be given the data set for some problem, in our case we will first make a data set of all the air pollutants like pm2.5,pm10,carbon di-oxide, So2, carbon mono oxide, lead, nitrogen etc.

b) Now we will be having a data set of air pollutants for which we have to determine that what is the level of air quality.

c) Now, we will select/choose the random sample from the above data set.

d) In the next step the random forest algorithm will construct a decision tree for every sample which we have calculated.

e) Now, it will get the prediction result from every decision tree.

f) In this step the result will be picked by voting from all the decision tree output.

g) At last we will select that tree which will be voted for the most of time as the final result of the problem.

The above methodology is showed by below diagram.

Figure.2. Flow diagram of Random forest methodology

4. Results

We have found that In India,there was not any research related to our work i.e for predicting the air pollution using random forest but there are many work related to this project outside the India mainly in China. In China, they had also used the same technique what we have used for predicting the air quality in remote sensing area mainly in Shenyang.

We are also using the same formulas what they had used for the prediction of air quality in NEW DELHI. We will use different types of data for AQI measurement like POP data, traffic

(6)

have not collected any data required for our prediction like traffic data,road data, poi data,meteorology data,presence of air pollutants etc. We have just proposed a idea not the implementation of our idea because we are just students and it is not possible for us to collect all these data.

We had just given the idea and concept of Random forest classification so that in future it could be used for air quality prediction.

We had clearly mentioned that what one should do to predict the air pollution in their region.

We had given all the algorithm,formulas that how it will work and from where we need to collect the data for their research.

There are mainly two factors which effects the performance of random forest i.e. number of trees and number of features.

Figure.3.OOB error result distribution

In the above graph we have taken number of trees sample on y-axis and number of features sample on x-axis, we had taken the integer value for number of trees and number of features so by this graph we will be able to get only coordinate value such as (2,300),(4,700) etc. The deeper the color in the graph will have smaller OOB error values,different color in this graph means different types of OOB error. As we can clearly see from the graph that the value of OOB error is small when the coordinates value is (4,400) and (7,1000) we will take the small coordinate for our reference.

5. Conclusions

In this paper, our model is going to predict the presence of pollution in air based on AQI level. We will determine the Air quality index at different places in New Delhi. We will use many different types of data for measuring the AQI index like we will use Meteorology Data, traffic and road data provided by some application like google map

Thus;the result is yet to be determined.

6 . Future Scope

(7)

The future scope scope of Random Forest algorithm is very high as it is best way to find diverse problem with more accuracy as it uses many tree algorithm for finding the result so we get more accurate solution using this algorithm . Online continuous and endless stream processing is creating a challenge for the machine learning community to provide that which data is important and more accurate. For improvement in accuracy and performance of stream data for random forest is important field of research. Random forest using semi supervised learning approach is an open field for research.with semi supervised learning approach, it will be possible to construct a classifier using the combination of both labeled and unlabeled data. This approach is useful for both online and offline problem. As we know that Random forest uses many tree structure to give its result with more accuracy than any other classifier in the case of classifying imbalanced data almost all the classifier have problem as they mainly ignore the minority problems. There are many real life problems which deals with the imbalanced data such as disease diagnosis, fraud detection etc, therefore classifier for imbalanced data are in demand. So for this problem Random forest with some modification will work in imbalanced data set problems.

7. Conflicts of Interest: We declare no conflicts of interest

8. REFERENCE

[1] Sahil Singh, Ayush Yadav, Akhilesh Kumar(June 18, 2021).Prediction of Air Pollution using Random Forest.[Pollution Analysis].3rd IEEE International Conference on Advances in Computing, Communication Control and Networking 2021, Greater Noida, India.

[2] Breiman, L. (1996b). Out-of-bag estimation

[3] Nguyen Quoc Khanh Le. Fertility-GRU: Identifying Fertility-Related Proteins by Incorporating Deep-Gated Recurrent Units and Original Position-Specific Scoring Matrix Profiles. Journal of Proteome Research 2019, 18 (9) , 3503-3511.

[4] Yuanqing Mao, Hongliang Yang, Ye Sheng, Jiping Wang, Runhai Ouyang, Caichao Ye, Jiong Yang, Wenqing Zhang. Prediction and Classification of Formation Energies of Binary Compounds by Machine Learning: An Approach without Crystal Structure Information. ACS Omega 2021, 6 (22) , 14533-14541.

[5] Daniel J. Fowles, David S. Palmer, Rui Guo, Sarah L. Price, John B. O. Mitchell. Toward Physics-Based Solubility Computation for Pharmaceuticals to Rival Informatics. Journal of Chemical Theory and Computation 2021, 17 (6) , 3700-3709.

[6] Daiguo Deng, Xiaowei Chen, Ruochi Zhang, Zengrong Lei, Xiaojian Wang, Fengfeng Zhou. XGraphBoost: Extracting Graph Neural Network-Based Features for a Better Prediction of Molecular Properties. Journal of Chemical Information and Modeling 2021, Article ASAP.

[7] Ziduo Yang, Weihe Zhong, Lu Zhao, Calvin Yu-Chian Chen. ML- DTI: Mutual Learning Mechanism for Interpretable Drug–Target Interaction Prediction. The Journal of Physical Chemistry Letters 2021, 12 (17) , 4247-4261.

[8] Guohong Liu, Xiliang Yan, Shenqing Wang, Qianhui Yu, Jianbo Jia, Bing Yan.

Elucidation of the Critical Role of Core Materials in PM2.5-Induced Cytotoxicity by Interrogating Silica- and Carbon- Based Model PM2.5 Particle Libraries. Environmental Science

& Technology 2021, 55 (9) , 6128-6139.

[9] Jian Jiang, Rui Wang, Guo-Wei Wei. GGL-Tox: Geometric Graph Learning for Toxicity Prediction. Journal of Chemical Information and Modeling 2021, 61 (4) , 1691-1700.

(8)

Predicting Reaction Yields via Supervised Learning. Accounts of Chemical Research 2021, 54 (8) , 1856-1865.

[11] Guokui Zheng, Yanle Li, Xu Qian, Ge Yao, Ziqi Tian, Xingwang Zhang, Liang Chen.

High-Throughput Screening of a Single-Atom Alloy for Electroreduction of Dinitrogen to Ammonia. ACS Applied Materials & Interfaces 2021, 13 (14) , 16336-16344.

[12] Yohei Kosugi, Natalie Hosea Kelvin Cooper, Christopher Baddeley, Bernie French, Katherine Gibson, James Golden, Thiam Lee, Sadrach Pierre, Brent Weiss, Jason Yang.

Novel Development of Predictive Feature Fingerprints to Identify Chemistry-Based Features for the Effective Drug Design of SARS-CoV-2 Target Antagonists and Inhibitors Using Machine Learning. ACS Omega 2021, 6 (7) , 4857-4877.

[13] Benjamin P. Brown, Jeffrey Mendenhall, Alexander R. Geanes, Jens Meiler. Sankalp Jain, Vishal B. Siramshetty, Vinicius M. Alves, Eugene N. Muratov, Nicole Kleinstreuer, Alexander Tropsha, Marc C. Nicklaus, Anton Simeonov, Alexey V. Zakharov.

[14] Marcus W. H. Wang, Jonathan M. Goodman, Timothy E. H. Allen. Machine Learning in Predictive Toxicology: Recent Applications and Future Directions for Classification Models. Chemical Research in Toxicology 2021, 34 (2) , 217-239.

[15] Breiman. L. (1998b). Randomizing outputs to increase prediction accuracy.

[16] Ting Li, Weida Tong, Ruth Roberts, Zhichao Liu, Shraddha Thakkar. DeepDILI: Deep Learning-Powered Drug-Induced Liver Injury Prediction Using Model-Level Representation.

Chemical Research in Toxicology 2021, 34 (2) , 550-565.

[17] Vishal B. Siramshetty, Dac-Trung Nguyen, Natalia J. Martinez, Noel

[18] T. Southall, Anton Simeonov, Alexey V. Zakharov. Daniel A. Vallero, in Air Pollution Calculations, 2019

[19] Anthony DiFranzo, Robert P. Sheridan, Andy Liaw, Matthew Tudor. Nearest Neighbor Gaussian Process for Quantitative Structure–Activity Relationships. Journal of Chemical Information and Modeling 2020, 60 (10) , 4653-4663.

[20] Kangway V. Chuang, Laura M. Gunsalus, Michael J. Keiser. Learning Molecular Representations for Medicinal Chemistry. Journal of Medicinal Chemistry 2020, 63 (16) , 8705- 8722.

[21] Adrian Stecula, Muhammad S. Hussain, Ronald E. Viola. Discovery of Novel Inhibitors of a Critical Brain Enzyme Using a Homology Model and a Deep Convolutional Neural Network.

Journal of Medicinal Chemistry 2020, 63 (16) , 8867-8875.

[22] Evan N. Feinberg, Elizabeth Joshi, Vijay S. Pande, Alan C. Cheng. Improvement in ADMET Prediction with Multitask Deep Featurization. Journal of Medicinal Chemistry 2020, 63 (16) , 8835- 8848.PM2.5.in.[(accessed on 26 may 2021)]

[23] Yohei Kosugi, Natalie Hosea. Direct Comparison of Total Clearance Prediction:

Computational Machine Learning Model versus Bottom- Up Approach Using In Vitro Assay. Molecular Pharmaceutics 2020, 17 (7) , 2299-2309.

[24] Yuting Xu, Deeptak Verma, Robert P. Sheridan, Andy Liaw, Junshui Ma, Nicholas M.

Marshall, John McIntosh, Edward C. Sherer, Vladimir Svetnik, Jennifer M. Johnston. Deep Dive into Machine Learning Models for Protein Engineering. Journal of Chemical Information and Modeling 2020, 60 (6) , 2773-2790.

[25] Yuan J., Zheng Y., Xie X. Discovering regions of different functions in a city using human mobility and POIs; Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Beijing, China. 12–16 August 2012; pp. 186–194.

(9)

[26] RP5.ru: Weather for 243 Countries of the World. [(accessed on 25 January 2020)].

[27] Baidu Map. [(accessed on 28 May 2021)].

[28] Google Map. [(accessed on 28 May 2021)].

[29] Xin Yang, Yifei Wang, Ryan Byrne, Gisbert Schneider, Shengyong Yang. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chemical Reviews 2019, 119 (18) , 10520-10594.