Read Text Data in Python as Dattaframe

27. Reading and Writing Data in Pandas

By Bernd Klein. Terminal modified: 01 February 2022.

All the powerful data structures like the Serial and the DataFrames would avail to cipher, if the Pandas module wouldn't provide powerful functionalities for reading in and writing out data. Information technology is not only a thing of having a functions for interacting with files. To be useful to data scientists it also needs functions which back up the most important data formats like

  • Delimiter-separated files, like e.g. csv
  • Microsoft Excel files
  • HTML
  • XML
  • JSON

Delimiter-separated Values

Digits as File Input and Output

Most people take csv files as a synonym for delimter-separated values files. They exit the fact out of account that csv is an acronym for "comma separated values", which is not the case in many situations. Pandas also uses "csv" and contexts, in which "dsv" would exist more than appropriate.

Delimiter-separated values (DSV) are defined and stored 2-dimensional arrays (for case strings) of data by separating the values in each row with delimiter characters defined for this purpose. This fashion of implementing data is often used in combination of spreadsheet programs, which can read in and write out data as DSV. They are besides used as a general data commutation format.

We call a text file a "delimited text file" if information technology contains text in DSV format.

For example, the file dollar_euro.txt is a delimited text file and uses tabs (\t) equally delimiters.

Reading CSV and DSV Files

Pandas offers two means to read in CSV or DSV files to be precise:

  • DataFrame.from_csv
  • read_csv

At that place is no big departure between those ii functions, due east.g. they have dissimilar default values in some cases and read_csv has more paramters. We volition focus on read_csv, considering DataFrame.from_csv is kept inside Pandas for reasons of backwards compatibility.

              import              pandas              as              pd              exchange_rates              =              pd              .              read_csv              (              "/data1/dollar_euro.txt"              ,              sep              =              "              \t              "              )              print              (              exchange_rates              )            

OUTPUT:

              Year   Average  Min USD/EUR  Max USD/EUR  Working days 0   2016  0.901696     0.864379     0.959785           247 ane   2015  0.901896     0.830358     0.947688           256 2   2014  0.753941     0.716692     0.823655           255 iii   2013  0.753234     0.723903     0.783208           255 iv   2012  0.778848     0.743273     0.827198           256 5   2011  0.719219     0.671953     0.775855           257 half-dozen   2010  0.755883     0.686672     0.837381           258 7   2009  0.718968     0.661376     0.796495           256 eight   2008  0.683499     0.625391     0.802568           256 9   2007  0.730754     0.672314     0.775615           255 10  2006  0.797153     0.750131     0.845594           255 11  2005  0.805097     0.740357     0.857118           257 12  2004  0.804828     0.733514     0.847314           259 13  2003  0.885766     0.791766     0.963670           255 14  2002  1.060945     0.953562     1.165773           255 15  2001  1.117587     1.047669     1.192748           255 xvi  2000  one.085899     0.962649     one.211827           255 17  1999  0.939475     0.848176     0.998502           261            

As we can come across, read_csv used automatically the first line as the names for the columns. It is possible to requite other names to the columns. For this purpose, we have to skip the get-go line by setting the parameter "header" to 0 and nosotros have to assign a list with the cavalcade names to the parameter "names":

              import              pandas              every bit              pd              exchange_rates              =              pd              .              read_csv              (              "/data1/dollar_euro.txt"              ,              sep              =              "              \t              "              ,              header              =              0              ,              names              =              [              "year"              ,              "min"              ,              "max"              ,              "days"              ])              impress              (              exchange_rates              )            

OUTPUT:

              year       min       max  days 2016  0.901696  0.864379  0.959785   247 2015  0.901896  0.830358  0.947688   256 2014  0.753941  0.716692  0.823655   255 2013  0.753234  0.723903  0.783208   255 2012  0.778848  0.743273  0.827198   256 2011  0.719219  0.671953  0.775855   257 2010  0.755883  0.686672  0.837381   258 2009  0.718968  0.661376  0.796495   256 2008  0.683499  0.625391  0.802568   256 2007  0.730754  0.672314  0.775615   255 2006  0.797153  0.750131  0.845594   255 2005  0.805097  0.740357  0.857118   257 2004  0.804828  0.733514  0.847314   259 2003  0.885766  0.791766  0.963670   255 2002  1.060945  0.953562  1.165773   255 2001  1.117587  one.047669  1.192748   255 2000  1.085899  0.962649  one.211827   255 1999  0.939475  0.848176  0.998502   261            

Exercise one

The file "countries_population.csv" is a csv file, containing the population numbers of all countries (July 2014). The delimiter of the file is a space and commas are used to split groups of thousands in the numbers. The method 'caput(n)' of a DataFrame can be used to give out only the showtime n rows or lines. Read the file into a DataFrame.

Solution:

              pop              =              pd              .              read_csv              (              "/data1/countries_population.csv"              ,              header              =              None              ,              names              =              [              "State"              ,              "Population"              ],              index_col              =              0              ,              quotechar              =              "'"              ,              sep              =              " "              ,              thousands              =              ","              )              print              (              popular              .              head              (              5              ))            

OUTPUT:

              Population Country                    People's republic of china           1355692576 Republic of india           1236344631 European Spousal relationship   511434812 United states    318892103 Indonesia        253609643            

Writing csv Files

Writing CSV Files

Nosotros tin create csv (or dsv) files with the method "to_csv". Earlier we do this, we volition prepare some data to output, which we will write to a file. Nosotros have two csv files with population data for various countries. countries_male_population.csv contains the figures of the male populations and countries_female_population.csv correspondingly the numbers for the female person populations. We volition create a new csv file with the sum:

            column_names            =            [            "Country"            ]            +            list            (            range            (            2002            ,            2013            ))            male_pop            =            pd            .            read_csv            (            "/data1/countries_male_population.csv"            ,            header            =            None            ,            index_col            =            0            ,            names            =            column_names            )            female_pop            =            pd            .            read_csv            (            "/data1/countries_female_population.csv"            ,            header            =            None            ,            index_col            =            0            ,            names            =            column_names            )            population            =            male_pop            +            female_pop          
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Country
Australia 19640979.0 19872646 20091504 20339759 20605488 21015042 21431781 21874920 22342398 22620554 22683573
Austria 8139310.0 8067289 8140122 8206524 8265925 8298923 8331930 8355260 8375290 8404252 8443018
Belgium 10309725.0 10355844 10396421 10445852 10511382 10584534 10666866 10753080 10839905 10366843 11035958
Canada NaN 31361611 31372587 31989454 32299496 32649482 32927372 33327337 33334414 33927935 34492645
Czech Republic 10269726.0 10203269 10211455 10220577 10251079 10287189 10381130 10467542 10506813 10532770 10505445
Denmark 5368354.0 5383507 5397640 5411405 5427459 5447084 5475791 5511451 5534738 5560628 5580516
Finland 5194901.0 5206295 5219732 5236611 5255580 5276955 5300484 5326314 5351427 5375276 5401267
France 59337731.0 59630121 59900680 62518571 62998773 63392140 63753140 64366962 64716310 65129746 65394283
Germany 82440309.0 82536680 82531671 82500849 82437995 82314906 82217837 82002356 81802257 81751602 81843743
Greece 10988000.0 11006377 11040650 11082751 11125179 11171740 11213785 11260402 11305118 11309885 11290067
Hungary 10174853.0 10142362 10116742 10097549 10076581 10066158 10045401 10030975 10014324 9985722 9957731
Iceland 286575.0 288471 290570 293577 299891 307672 315459 319368 317630 318452 319575
Republic of ireland 3882683.0 3963636 4027732 4109173 4209019 4239848 4401335 4450030 4467854 4569864 4582769
Italy 56993742.0 57321070 57888245 58462375 58751711 59131287 59619290 60045068 60340328 60626442 60820696
Nippon 127291000.0 127435000 127620000 127687000 127767994 127770000 127771000 127692000 127510000 128057000 127799000
Korea 47639618.0 47925318 48082163 48138077 48297184 48456369 48606787 48746693 48874539 49779440 50004441
Luxembourg 444050.0 448300 451600 455000 469086 476187 483799 493500 502066 511840 524853
Mexico 101826249.0 103039964 104213503 103001871 103946866 104874282 105790725 106682518 107550697 108396211 115682867
Netherlands 16105285.0 16192572 16258032 16305526 16334210 16357992 16405399 16485787 16574989 16655799 16730348
New Zealand 3939130.0 4009200 4062500 4100570 4139470 4228280 4268880 4315840 4367740 4405150 4433100
Norway 4524066.0 4552252 4577457 4606363 4640219 4681134 4737171 4799252 4858199 4920305 4985870
Poland 38632453.0 38218531 38190608 38173835 38157055 38125479 38115641 38135876 38167329 38200037 38538447
Portugal 10335559.0 10407465 10474685 10529255 10569592 10599095 10617575 10627250 10637713 10636979 10542398
Slovak Republic 5378951.0 5379161 5380053 5384822 5389180 5393637 5400998 5412254 5424925 5435273 5404322
Kingdom of spain 40409330.0 41550584 42345342 43038035 43758250 44474631 45283259 45828172 45989016 46152926 46818221
Sweden 8909128.0 8940788 8975670 9011392 9047752 9113257 9182927 9256347 9340682 9415570 9482855
Switzerland 7261210.0 7313853 7364148 7415102 7459128 7508739 7593494 7701856 7785806 7870134 7954662
Turkey NaN 70171979 70689500 71607500 72519974 72519974 70586256 71517100 72561312 73722988 74724269
United kingdom 58706905.0 59262057 59699828 60059858 60412870 60781346 61179260 61595094 62026962 62498612 63256154
U.s. 277244916.0 288774226 290810719 294442683 297308143 300184434 304846731 305127551 307756577 309989078 312232049
            population            .            to_csv            (            "/data1/countries_total_population.csv"            )          

We want to create a new DataFrame with all the data, i.due east. female, male and complete population. This means that we have to introduce an hierarchical index. Before nosotros practise it on our DataFrame, nosotros will introduce this problem in a unproblematic example:

              import              pandas              as              pd              shop1              =              {              "foo"              :{              2010              :              23              ,              2011              :              25              },              "bar"              :{              2010              :              13              ,              2011              :              29              }}              shop2              =              {              "foo"              :{              2010              :              223              ,              2011              :              225              },              "bar"              :{              2010              :              213              ,              2011              :              229              }}              shop1              =              pd              .              DataFrame              (              shop1              )              shop2              =              pd              .              DataFrame              (              shop2              )              both_shops              =              shop1              +              shop2              impress              (              "Sales of shop1:              \due north              "              ,              shop1              )              print              (              "              \n              Sales of both shops              \n              "              ,              both_shops              )            

OUTPUT:

Sales of shop1:        foo  bar 2010   23   thirteen 2011   25   29  Sales of both shops        foo  bar 2010  246  226 2011  250  258            
              shops              =              pd              .              concat              ([              shop1              ,              shop2              ],              keys              =              [              "one"              ,              "two"              ])              shops            
foo bar
one 2010 23 13
2011 25 29
two 2010 223 213
2011 225 229

We desire to bandy the hierarchical indices. For this we volition apply 'swaplevel':

              shops              .              swaplevel              ()              shops              .              sort_index              (              inplace              =              True              )              shops            
foo bar
one 2010 23 13
2011 25 29
two 2010 223 213
2011 225 229

We will go dorsum to our initial trouble with the population figures. We will utilize the aforementioned steps to those DataFrames:

              pop_complete              =              pd              .              concat              ([              population              .              T              ,              male_pop              .              T              ,              female_pop              .              T              ],              keys              =              [              "total"              ,              "male"              ,              "female"              ])              df              =              pop_complete              .              swaplevel              ()              df              .              sort_index              (              inplace              =              True              )              df              [[              "Austria"              ,              "Commonwealth of australia"              ,              "France"              ]]            
Country Austria Australia France
2002 female person 4179743.0 9887846.0 30510073.0
male 3959567.0 9753133.0 28827658.0
total 8139310.0 19640979.0 59337731.0
2003 female 4158169.0 9999199.0 30655533.0
male 3909120.0 9873447.0 28974588.0
total 8067289.0 19872646.0 59630121.0
2004 female person 4190297.0 10100991.0 30789154.0
male 3949825.0 9990513.0 29111526.0
total 8140122.0 20091504.0 59900680.0
2005 female person 4220228.0 10218321.0 32147490.0
male 3986296.0 10121438.0 30371081.0
total 8206524.0 20339759.0 62518571.0
2006 female 4246571.0 10348070.0 32390087.0
male 4019354.0 10257418.0 30608686.0
total 8265925.0 20605488.0 62998773.0
2007 female 4261752.0 10570420.0 32587979.0
male 4037171.0 10444622.0 30804161.0
full 8298923.0 21015042.0 63392140.0
2008 female 4277716.0 10770864.0 32770860.0
male 4054214.0 10660917.0 30982280.0
full 8331930.0 21431781.0 63753140.0
2009 female 4287213.0 10986535.0 33208315.0
male 4068047.0 10888385.0 31158647.0
full 8355260.0 21874920.0 64366962.0
2010 female person 4296197.0 11218144.0 33384930.0
male person 4079093.0 11124254.0 31331380.0
total 8375290.0 22342398.0 64716310.0
2011 female person 4308915.0 11359807.0 33598633.0
male 4095337.0 11260747.0 31531113.0
full 8404252.0 22620554.0 65129746.0
2012 female 4324983.0 11402769.0 33723892.0
male 4118035.0 11280804.0 31670391.0
total 8443018.0 22683573.0 65394283.0
            df            .            to_csv            (            "/data1/countries_total_population.csv"            )          

Live Python preparation

instructor-led training course

Upcoming online Courses

Information Analysis With Python

09 Mar 2022 to xi Mar 2022
18 May 2022 to twenty May 2022
31 Aug 2022 to 02 Sep 2022
nineteen Oct 2022 to 21 Oct 2022

Enrol here

Exercise ii

  • Read in the dsv file (csv) bundeslaender.txt. Create a new file with the columns 'land', 'area', 'female', 'male', 'population' and 'density' (inhabitants per foursquare kilometres.
  • print out the rows where the area is greater than 30000 and the population is greater than 10000
  • Print the rows where the density is greater than 300
              lands              =              pd              .              read_csv              (              '/data1/bundeslaender.txt'              ,              sep              =              " "              )              print              (              lands              .              columns              .              values              )            

OUTPUT:

['state' 'expanse' 'male person' 'female']            
              # swap the columns of our DataFrame:              lands              =              lands              .              reindex              (              columns              =              [              'country'              ,              'area'              ,              'female'              ,              'male'              ])              lands              [:              ii              ]            
state area female male
0 Baden-Württemberg 35751.65 5465 5271
one Bayern 70551.57 6366 6103
            lands            .            insert            (            loc            =            len            (            lands            .            columns            ),            column            =            'population'            ,            value            =            lands            [            'female person'            ]            +            lands            [            'male person'            ])          
land area female male population
0 Baden-Württemberg 35751.65 5465 5271 10736
1 Bayern 70551.57 6366 6103 12469
two Berlin 891.85 1736 1660 3396
              lands              .              insert              (              loc              =              len              (              lands              .              columns              ),              cavalcade              =              'density'              ,              value              =              (              lands              [              'population'              ]              *              thousand              /              lands              [              'area'              ])              .              round              (              0              ))              lands              [:              4              ]            
land expanse female male population density
0 Baden-Württemberg 35751.65 5465 5271 10736 300.0
i Bayern 70551.57 6366 6103 12469 177.0
ii Berlin 891.85 1736 1660 3396 3808.0
iii Brandenburg 29478.61 1293 1267 2560 87.0
              impress              (              lands              .              loc              [(              lands              .              area              >              30000              )              &              (              lands              .              population              >              10000              )])            

OUTPUT:

              land      area  female  male person  population  density 0    Baden-Württemberg  35751.65    5465  5271       10736    300.0 1               Bayern  70551.57    6366  6103       12469    177.0 nine  Nordrhein-Westfalen  34085.29    9261  8797       18058    530.0            

Reading and Writing Excel Files

It is also possible to read and write Microsoft Excel files. The Pandas functionalities to read and write Excel files utilise the modules 'xlrd' and 'openpyxl'. These modules are not automatically installed by Pandas, so you may take to install them manually!

We will utilise a simple Excel document to demonstrate the reading capabilities of Pandas. The document sales.xls contains two sheets, 1 called 'week1' and the other one 'week2'.
An Excel file can exist read in with the Pandas function "read_excel". This is demonstrated in the following example Python code:

              excel_file              =              pd              .              ExcelFile              (              "/data1/sales.xls"              )              canvas              =              pd              .              read_excel              (              excel_file              )              sheet            
Weekday Sales
0 Monday 123432.980000
1 Tuesday 122198.650200
2 Wednesday 134418.515220
3 Th 131730.144916
4 Fri 128173.431003

The document "sales.xls" contains 2 sheets, but we just have been able to read in the get-go one with "read_excel". A complete Excel document, which tin consist of an arbitrary number of sheets, can exist completely read in similar this:

              docu              =              {}              for              sheet_name              in              excel_file              .              sheet_names              :              docu              [              sheet_name              ]              =              excel_file              .              parse              (              sheet_name              )              for              sheet_name              in              docu              :              impress              (              "              \n              "              +              sheet_name              +              ":              \n              "              ,              docu              [              sheet_name              ])            

OUTPUT:

week1:       Weekday          Sales 0     Monday  123432.980000 i    Tuesday  122198.650200 2  Wednesday  134418.515220 3   Th  131730.144916 4     Fri  128173.431003  week2:       Weekday          Sales 0     Monday  223277.980000 1    Tuesday  234441.879000 2  Wednesday  246163.972950 three   Thursday  241240.693491 4     Fri  230143.621590            

Nosotros volition calculate now the avarage sales numbers of the 2 weeks:

              average              =              docu              [              "week1"              ]              .              re-create              ()              average              [              "Sales"              ]              =              (              docu              [              "week1"              ][              "Sales"              ]              +              docu              [              "week2"              ][              "Sales"              ])              /              2              impress              (              average              )            

OUTPUT:

              Weekday          Sales 0     Monday  173355.480000 1    Tuesday  178320.264600 2  Wednesday  190291.244085 3   Thursday  186485.419203 4     Friday  179158.526297            

We will save the DataFrame 'average' in a new certificate with 'week1' and 'week2' as additional sheets every bit well:

            writer            =            pd            .            ExcelWriter            (            '/data1/sales_average.xlsx'            )            document            [            'week1'            ]            .            to_excel            (            writer            ,            'week1'            )            document            [            'week2'            ]            .            to_excel            (            author            ,            'week2'            )            average            .            to_excel            (            writer            ,            'average'            )            writer            .            relieve            ()            writer            .            close            ()          

Sales_average LibreOffice

Live Python training

instructor-led training course

Upcoming online Courses

Information Assay With Python

09 Mar 2022 to 11 Mar 2022
eighteen May 2022 to 20 May 2022
31 Aug 2022 to 02 Sep 2022
nineteen Oct 2022 to 21 Oct 2022

Enrol here

aguilarduchich.blogspot.com

Source: https://python-course.eu/numerical-programming/reading-and-writing-data-in-pandas.php

0 Response to "Read Text Data in Python as Dattaframe"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel