Read Text Data in Python as Dattaframe
27. Reading and Writing Data in Pandas
By Bernd Klein. Terminal modified: 01 February 2022.
All the powerful data structures like the Serial and the DataFrames would avail to cipher, if the Pandas module wouldn't provide powerful functionalities for reading in and writing out data. Information technology is not only a thing of having a functions for interacting with files. To be useful to data scientists it also needs functions which back up the most important data formats like
- Delimiter-separated files, like e.g. csv
- Microsoft Excel files
- HTML
- XML
- JSON
Delimiter-separated Values
Most people take csv files as a synonym for delimter-separated values files. They exit the fact out of account that csv is an acronym for "comma separated values", which is not the case in many situations. Pandas also uses "csv" and contexts, in which "dsv" would exist more than appropriate.
Delimiter-separated values (DSV) are defined and stored 2-dimensional arrays (for case strings) of data by separating the values in each row with delimiter characters defined for this purpose. This fashion of implementing data is often used in combination of spreadsheet programs, which can read in and write out data as DSV. They are besides used as a general data commutation format.
We call a text file a "delimited text file" if information technology contains text in DSV format.
For example, the file dollar_euro.txt is a delimited text file and uses tabs (\t) equally delimiters.
Reading CSV and DSV Files
Pandas offers two means to read in CSV or DSV files to be precise:
- DataFrame.from_csv
- read_csv
At that place is no big departure between those ii functions, due east.g. they have dissimilar default values in some cases and read_csv has more paramters. We volition focus on read_csv, considering DataFrame.from_csv is kept inside Pandas for reasons of backwards compatibility.
import pandas as pd exchange_rates = pd . read_csv ( "/data1/dollar_euro.txt" , sep = " \t " ) print ( exchange_rates )
OUTPUT:
Year Average Min USD/EUR Max USD/EUR Working days 0 2016 0.901696 0.864379 0.959785 247 ane 2015 0.901896 0.830358 0.947688 256 2 2014 0.753941 0.716692 0.823655 255 iii 2013 0.753234 0.723903 0.783208 255 iv 2012 0.778848 0.743273 0.827198 256 5 2011 0.719219 0.671953 0.775855 257 half-dozen 2010 0.755883 0.686672 0.837381 258 7 2009 0.718968 0.661376 0.796495 256 eight 2008 0.683499 0.625391 0.802568 256 9 2007 0.730754 0.672314 0.775615 255 10 2006 0.797153 0.750131 0.845594 255 11 2005 0.805097 0.740357 0.857118 257 12 2004 0.804828 0.733514 0.847314 259 13 2003 0.885766 0.791766 0.963670 255 14 2002 1.060945 0.953562 1.165773 255 15 2001 1.117587 1.047669 1.192748 255 xvi 2000 one.085899 0.962649 one.211827 255 17 1999 0.939475 0.848176 0.998502 261
As we can come across, read_csv used automatically the first line as the names for the columns. It is possible to requite other names to the columns. For this purpose, we have to skip the get-go line by setting the parameter "header" to 0 and nosotros have to assign a list with the cavalcade names to the parameter "names":
import pandas every bit pd exchange_rates = pd . read_csv ( "/data1/dollar_euro.txt" , sep = " \t " , header = 0 , names = [ "year" , "min" , "max" , "days" ]) impress ( exchange_rates )
OUTPUT:
year min max days 2016 0.901696 0.864379 0.959785 247 2015 0.901896 0.830358 0.947688 256 2014 0.753941 0.716692 0.823655 255 2013 0.753234 0.723903 0.783208 255 2012 0.778848 0.743273 0.827198 256 2011 0.719219 0.671953 0.775855 257 2010 0.755883 0.686672 0.837381 258 2009 0.718968 0.661376 0.796495 256 2008 0.683499 0.625391 0.802568 256 2007 0.730754 0.672314 0.775615 255 2006 0.797153 0.750131 0.845594 255 2005 0.805097 0.740357 0.857118 257 2004 0.804828 0.733514 0.847314 259 2003 0.885766 0.791766 0.963670 255 2002 1.060945 0.953562 1.165773 255 2001 1.117587 one.047669 1.192748 255 2000 1.085899 0.962649 one.211827 255 1999 0.939475 0.848176 0.998502 261
Exercise one
The file "countries_population.csv" is a csv file, containing the population numbers of all countries (July 2014). The delimiter of the file is a space and commas are used to split groups of thousands in the numbers. The method 'caput(n)' of a DataFrame can be used to give out only the showtime n rows or lines. Read the file into a DataFrame.
Solution:
pop = pd . read_csv ( "/data1/countries_population.csv" , header = None , names = [ "State" , "Population" ], index_col = 0 , quotechar = "'" , sep = " " , thousands = "," ) print ( popular . head ( 5 ))
OUTPUT:
Population Country People's republic of china 1355692576 Republic of india 1236344631 European Spousal relationship 511434812 United states 318892103 Indonesia 253609643
Writing csv Files
Nosotros tin create csv (or dsv) files with the method "to_csv". Earlier we do this, we volition prepare some data to output, which we will write to a file. Nosotros have two csv files with population data for various countries. countries_male_population.csv contains the figures of the male populations and countries_female_population.csv correspondingly the numbers for the female person populations. We volition create a new csv file with the sum:
column_names = [ "Country" ] + list ( range ( 2002 , 2013 )) male_pop = pd . read_csv ( "/data1/countries_male_population.csv" , header = None , index_col = 0 , names = column_names ) female_pop = pd . read_csv ( "/data1/countries_female_population.csv" , header = None , index_col = 0 , names = column_names ) population = male_pop + female_pop
2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||
Australia | 19640979.0 | 19872646 | 20091504 | 20339759 | 20605488 | 21015042 | 21431781 | 21874920 | 22342398 | 22620554 | 22683573 |
Austria | 8139310.0 | 8067289 | 8140122 | 8206524 | 8265925 | 8298923 | 8331930 | 8355260 | 8375290 | 8404252 | 8443018 |
Belgium | 10309725.0 | 10355844 | 10396421 | 10445852 | 10511382 | 10584534 | 10666866 | 10753080 | 10839905 | 10366843 | 11035958 |
Canada | NaN | 31361611 | 31372587 | 31989454 | 32299496 | 32649482 | 32927372 | 33327337 | 33334414 | 33927935 | 34492645 |
Czech Republic | 10269726.0 | 10203269 | 10211455 | 10220577 | 10251079 | 10287189 | 10381130 | 10467542 | 10506813 | 10532770 | 10505445 |
Denmark | 5368354.0 | 5383507 | 5397640 | 5411405 | 5427459 | 5447084 | 5475791 | 5511451 | 5534738 | 5560628 | 5580516 |
Finland | 5194901.0 | 5206295 | 5219732 | 5236611 | 5255580 | 5276955 | 5300484 | 5326314 | 5351427 | 5375276 | 5401267 |
France | 59337731.0 | 59630121 | 59900680 | 62518571 | 62998773 | 63392140 | 63753140 | 64366962 | 64716310 | 65129746 | 65394283 |
Germany | 82440309.0 | 82536680 | 82531671 | 82500849 | 82437995 | 82314906 | 82217837 | 82002356 | 81802257 | 81751602 | 81843743 |
Greece | 10988000.0 | 11006377 | 11040650 | 11082751 | 11125179 | 11171740 | 11213785 | 11260402 | 11305118 | 11309885 | 11290067 |
Hungary | 10174853.0 | 10142362 | 10116742 | 10097549 | 10076581 | 10066158 | 10045401 | 10030975 | 10014324 | 9985722 | 9957731 |
Iceland | 286575.0 | 288471 | 290570 | 293577 | 299891 | 307672 | 315459 | 319368 | 317630 | 318452 | 319575 |
Republic of ireland | 3882683.0 | 3963636 | 4027732 | 4109173 | 4209019 | 4239848 | 4401335 | 4450030 | 4467854 | 4569864 | 4582769 |
Italy | 56993742.0 | 57321070 | 57888245 | 58462375 | 58751711 | 59131287 | 59619290 | 60045068 | 60340328 | 60626442 | 60820696 |
Nippon | 127291000.0 | 127435000 | 127620000 | 127687000 | 127767994 | 127770000 | 127771000 | 127692000 | 127510000 | 128057000 | 127799000 |
Korea | 47639618.0 | 47925318 | 48082163 | 48138077 | 48297184 | 48456369 | 48606787 | 48746693 | 48874539 | 49779440 | 50004441 |
Luxembourg | 444050.0 | 448300 | 451600 | 455000 | 469086 | 476187 | 483799 | 493500 | 502066 | 511840 | 524853 |
Mexico | 101826249.0 | 103039964 | 104213503 | 103001871 | 103946866 | 104874282 | 105790725 | 106682518 | 107550697 | 108396211 | 115682867 |
Netherlands | 16105285.0 | 16192572 | 16258032 | 16305526 | 16334210 | 16357992 | 16405399 | 16485787 | 16574989 | 16655799 | 16730348 |
New Zealand | 3939130.0 | 4009200 | 4062500 | 4100570 | 4139470 | 4228280 | 4268880 | 4315840 | 4367740 | 4405150 | 4433100 |
Norway | 4524066.0 | 4552252 | 4577457 | 4606363 | 4640219 | 4681134 | 4737171 | 4799252 | 4858199 | 4920305 | 4985870 |
Poland | 38632453.0 | 38218531 | 38190608 | 38173835 | 38157055 | 38125479 | 38115641 | 38135876 | 38167329 | 38200037 | 38538447 |
Portugal | 10335559.0 | 10407465 | 10474685 | 10529255 | 10569592 | 10599095 | 10617575 | 10627250 | 10637713 | 10636979 | 10542398 |
Slovak Republic | 5378951.0 | 5379161 | 5380053 | 5384822 | 5389180 | 5393637 | 5400998 | 5412254 | 5424925 | 5435273 | 5404322 |
Kingdom of spain | 40409330.0 | 41550584 | 42345342 | 43038035 | 43758250 | 44474631 | 45283259 | 45828172 | 45989016 | 46152926 | 46818221 |
Sweden | 8909128.0 | 8940788 | 8975670 | 9011392 | 9047752 | 9113257 | 9182927 | 9256347 | 9340682 | 9415570 | 9482855 |
Switzerland | 7261210.0 | 7313853 | 7364148 | 7415102 | 7459128 | 7508739 | 7593494 | 7701856 | 7785806 | 7870134 | 7954662 |
Turkey | NaN | 70171979 | 70689500 | 71607500 | 72519974 | 72519974 | 70586256 | 71517100 | 72561312 | 73722988 | 74724269 |
United kingdom | 58706905.0 | 59262057 | 59699828 | 60059858 | 60412870 | 60781346 | 61179260 | 61595094 | 62026962 | 62498612 | 63256154 |
U.s. | 277244916.0 | 288774226 | 290810719 | 294442683 | 297308143 | 300184434 | 304846731 | 305127551 | 307756577 | 309989078 | 312232049 |
population . to_csv ( "/data1/countries_total_population.csv" )
We want to create a new DataFrame with all the data, i.due east. female, male and complete population. This means that we have to introduce an hierarchical index. Before nosotros practise it on our DataFrame, nosotros will introduce this problem in a unproblematic example:
import pandas as pd shop1 = { "foo" :{ 2010 : 23 , 2011 : 25 }, "bar" :{ 2010 : 13 , 2011 : 29 }} shop2 = { "foo" :{ 2010 : 223 , 2011 : 225 }, "bar" :{ 2010 : 213 , 2011 : 229 }} shop1 = pd . DataFrame ( shop1 ) shop2 = pd . DataFrame ( shop2 ) both_shops = shop1 + shop2 impress ( "Sales of shop1: \due north " , shop1 ) print ( " \n Sales of both shops \n " , both_shops )
OUTPUT:
Sales of shop1: foo bar 2010 23 thirteen 2011 25 29 Sales of both shops foo bar 2010 246 226 2011 250 258
shops = pd . concat ([ shop1 , shop2 ], keys = [ "one" , "two" ]) shops
foo | bar | ||
---|---|---|---|
one | 2010 | 23 | 13 |
2011 | 25 | 29 | |
two | 2010 | 223 | 213 |
2011 | 225 | 229 |
We desire to bandy the hierarchical indices. For this we volition apply 'swaplevel':
shops . swaplevel () shops . sort_index ( inplace = True ) shops
foo | bar | ||
---|---|---|---|
one | 2010 | 23 | 13 |
2011 | 25 | 29 | |
two | 2010 | 223 | 213 |
2011 | 225 | 229 |
We will go dorsum to our initial trouble with the population figures. We will utilize the aforementioned steps to those DataFrames:
pop_complete = pd . concat ([ population . T , male_pop . T , female_pop . T ], keys = [ "total" , "male" , "female" ]) df = pop_complete . swaplevel () df . sort_index ( inplace = True ) df [[ "Austria" , "Commonwealth of australia" , "France" ]]
Country | Austria | Australia | France | |
---|---|---|---|---|
2002 | female person | 4179743.0 | 9887846.0 | 30510073.0 |
male | 3959567.0 | 9753133.0 | 28827658.0 | |
total | 8139310.0 | 19640979.0 | 59337731.0 | |
2003 | female | 4158169.0 | 9999199.0 | 30655533.0 |
male | 3909120.0 | 9873447.0 | 28974588.0 | |
total | 8067289.0 | 19872646.0 | 59630121.0 | |
2004 | female person | 4190297.0 | 10100991.0 | 30789154.0 |
male | 3949825.0 | 9990513.0 | 29111526.0 | |
total | 8140122.0 | 20091504.0 | 59900680.0 | |
2005 | female person | 4220228.0 | 10218321.0 | 32147490.0 |
male | 3986296.0 | 10121438.0 | 30371081.0 | |
total | 8206524.0 | 20339759.0 | 62518571.0 | |
2006 | female | 4246571.0 | 10348070.0 | 32390087.0 |
male | 4019354.0 | 10257418.0 | 30608686.0 | |
total | 8265925.0 | 20605488.0 | 62998773.0 | |
2007 | female | 4261752.0 | 10570420.0 | 32587979.0 |
male | 4037171.0 | 10444622.0 | 30804161.0 | |
full | 8298923.0 | 21015042.0 | 63392140.0 | |
2008 | female | 4277716.0 | 10770864.0 | 32770860.0 |
male | 4054214.0 | 10660917.0 | 30982280.0 | |
full | 8331930.0 | 21431781.0 | 63753140.0 | |
2009 | female | 4287213.0 | 10986535.0 | 33208315.0 |
male | 4068047.0 | 10888385.0 | 31158647.0 | |
full | 8355260.0 | 21874920.0 | 64366962.0 | |
2010 | female person | 4296197.0 | 11218144.0 | 33384930.0 |
male person | 4079093.0 | 11124254.0 | 31331380.0 | |
total | 8375290.0 | 22342398.0 | 64716310.0 | |
2011 | female person | 4308915.0 | 11359807.0 | 33598633.0 |
male | 4095337.0 | 11260747.0 | 31531113.0 | |
full | 8404252.0 | 22620554.0 | 65129746.0 | |
2012 | female | 4324983.0 | 11402769.0 | 33723892.0 |
male | 4118035.0 | 11280804.0 | 31670391.0 | |
total | 8443018.0 | 22683573.0 | 65394283.0 |
df . to_csv ( "/data1/countries_total_population.csv" )
Live Python preparation
Upcoming online Courses
Information Analysis With Python
09 Mar 2022 to xi Mar 2022
18 May 2022 to twenty May 2022
31 Aug 2022 to 02 Sep 2022
nineteen Oct 2022 to 21 Oct 2022
Enrol here
Exercise ii
- Read in the dsv file (csv) bundeslaender.txt. Create a new file with the columns 'land', 'area', 'female', 'male', 'population' and 'density' (inhabitants per foursquare kilometres.
- print out the rows where the area is greater than 30000 and the population is greater than 10000
- Print the rows where the density is greater than 300
lands = pd . read_csv ( '/data1/bundeslaender.txt' , sep = " " ) print ( lands . columns . values )
OUTPUT:
['state' 'expanse' 'male person' 'female']
# swap the columns of our DataFrame: lands = lands . reindex ( columns = [ 'country' , 'area' , 'female' , 'male' ]) lands [: ii ]
state | area | female | male | |
---|---|---|---|---|
0 | Baden-Württemberg | 35751.65 | 5465 | 5271 |
one | Bayern | 70551.57 | 6366 | 6103 |
lands . insert ( loc = len ( lands . columns ), column = 'population' , value = lands [ 'female person' ] + lands [ 'male person' ])
land | area | female | male | population | |
---|---|---|---|---|---|
0 | Baden-Württemberg | 35751.65 | 5465 | 5271 | 10736 |
1 | Bayern | 70551.57 | 6366 | 6103 | 12469 |
two | Berlin | 891.85 | 1736 | 1660 | 3396 |
lands . insert ( loc = len ( lands . columns ), cavalcade = 'density' , value = ( lands [ 'population' ] * thousand / lands [ 'area' ]) . round ( 0 )) lands [: 4 ]
land | expanse | female | male | population | density | |
---|---|---|---|---|---|---|
0 | Baden-Württemberg | 35751.65 | 5465 | 5271 | 10736 | 300.0 |
i | Bayern | 70551.57 | 6366 | 6103 | 12469 | 177.0 |
ii | Berlin | 891.85 | 1736 | 1660 | 3396 | 3808.0 |
iii | Brandenburg | 29478.61 | 1293 | 1267 | 2560 | 87.0 |
impress ( lands . loc [( lands . area > 30000 ) & ( lands . population > 10000 )])
OUTPUT:
land area female male person population density 0 Baden-Württemberg 35751.65 5465 5271 10736 300.0 1 Bayern 70551.57 6366 6103 12469 177.0 nine Nordrhein-Westfalen 34085.29 9261 8797 18058 530.0
Reading and Writing Excel Files
It is also possible to read and write Microsoft Excel files. The Pandas functionalities to read and write Excel files utilise the modules 'xlrd' and 'openpyxl'. These modules are not automatically installed by Pandas, so you may take to install them manually!
We will utilise a simple Excel document to demonstrate the reading capabilities of Pandas. The document sales.xls contains two sheets, 1 called 'week1' and the other one 'week2'.
An Excel file can exist read in with the Pandas function "read_excel". This is demonstrated in the following example Python code:
excel_file = pd . ExcelFile ( "/data1/sales.xls" ) canvas = pd . read_excel ( excel_file ) sheet
Weekday | Sales | |
---|---|---|
0 | Monday | 123432.980000 |
1 | Tuesday | 122198.650200 |
2 | Wednesday | 134418.515220 |
3 | Th | 131730.144916 |
4 | Fri | 128173.431003 |
The document "sales.xls" contains 2 sheets, but we just have been able to read in the get-go one with "read_excel". A complete Excel document, which tin consist of an arbitrary number of sheets, can exist completely read in similar this:
docu = {} for sheet_name in excel_file . sheet_names : docu [ sheet_name ] = excel_file . parse ( sheet_name ) for sheet_name in docu : impress ( " \n " + sheet_name + ": \n " , docu [ sheet_name ])
OUTPUT:
week1: Weekday Sales 0 Monday 123432.980000 i Tuesday 122198.650200 2 Wednesday 134418.515220 3 Th 131730.144916 4 Fri 128173.431003 week2: Weekday Sales 0 Monday 223277.980000 1 Tuesday 234441.879000 2 Wednesday 246163.972950 three Thursday 241240.693491 4 Fri 230143.621590
Nosotros volition calculate now the avarage sales numbers of the 2 weeks:
average = docu [ "week1" ] . re-create () average [ "Sales" ] = ( docu [ "week1" ][ "Sales" ] + docu [ "week2" ][ "Sales" ]) / 2 impress ( average )
OUTPUT:
Weekday Sales 0 Monday 173355.480000 1 Tuesday 178320.264600 2 Wednesday 190291.244085 3 Thursday 186485.419203 4 Friday 179158.526297
We will save the DataFrame 'average' in a new certificate with 'week1' and 'week2' as additional sheets every bit well:
writer = pd . ExcelWriter ( '/data1/sales_average.xlsx' ) document [ 'week1' ] . to_excel ( writer , 'week1' ) document [ 'week2' ] . to_excel ( author , 'week2' ) average . to_excel ( writer , 'average' ) writer . relieve () writer . close ()
Live Python training
Upcoming online Courses
Information Assay With Python
09 Mar 2022 to 11 Mar 2022
eighteen May 2022 to 20 May 2022
31 Aug 2022 to 02 Sep 2022
nineteen Oct 2022 to 21 Oct 2022
Enrol here
Source: https://python-course.eu/numerical-programming/reading-and-writing-data-in-pandas.php
0 Response to "Read Text Data in Python as Dattaframe"
Post a Comment