Thursday 13 December 2012

Missing Values

You may have used the command MISSING VALUES in a file to declare a certain value(s) 
as missing, for example:

MISSING VALUES V1 (8,9)
 
Here the values 8,9 will be considered missing in the data and will not be included 
in computations. 
 
 But, sometimes you want those values back. One way would be to close the file and 
reload but that is cumbersome. The short way to do that is:
 
MISSING VALUES V1 ().

The above command will remove any previously declared values from missing category to the data category.

Thursday 16 August 2012

Data Cleaning (draft entry)

My next post will be about Data Cleaning. I am not the expert on this but I know few things. One simple way  to do this is to compare data entered by 2 different people. The command in SPSS is called

UPDATE FILE

Here is an example from UCLA site:

update file = "D:\person1.sav"
/in = flag1
/file = "D:\person2.sav"
/by all.
exe.
 
More valuable information in this pdf. 
http://www.ats.ucla.edu/stat/sas/library/nesug99/ss123.pdf
 

Need to update this blog!! 

Wednesday 15 August 2012

Effect Size


"Statistical significance only tells the researcher how likely it is that an observed finding could have occurred by chance. It does not say anything about magnitude of the effect observed. Effect size is a name given to a group of statistics that measure the magnitude of a treatment effect. In many cases, effect size is a better measure of research outcomes than the significance level. This is because with large samples, one can observe statistically significant group differences even when only a tiny effect is present. Unlike significance tests, effect size indices are independent of sample size." source: http://www.umdnj.edu/idsweb/shared/effect_size.htm

Effect size calculator

another calculator

another calculator

Tuesday 14 August 2012

Data/Software/information sources (free)

This is a loose compilation of sources of meta data/journals/software etc. related to population and health, concerning international issues in general but in particular the USA and Pakistan. I think this can be potentially very useful for graduate students of these two countries.

Asian Barometer "The Asian Barometer (ABS) is an applied research program on public opinion on political values, democracy, and governance around the region. The regional network encompasses research teams from 13 East Asian political systems (Japan, Mongolia, South Koreas, Taiwan, Hong Kong, China, the Philippines, Thailand, Vietnam, Cambodia, Singapore, Indonesia, and Malaysia), and 5 South Asian countries (India, Pakistan, Bangladesh, Sri Lanka, and Nepal)."


Databases/software (free) for social sciences and public health:
http://en.citizendium.org/wiki/Free_statistical_software#_note-idams

Current Population Survey
(CPS) Datasets for download (free; SAS format only)


Department of Health and Human Services
(HHS) Data Finder


The General Social Survey (GSS) contains a standard 'core' of demographic, behavioral, and attitudinal questions, plus topics of special interest. Many of the core questions have remained unchanged since 1972 to facilitate time-trend studies as well as replication of earlier findings. The GSS takes the pulse of America, and is a unique and valuable resource. It has tracked the opinions of Americans over the last four decades.

Download data (SPSS format) from here.

Univ of Michigan Database of data files
http://www.icpsr.umich.edu/icpsrweb/ICPSR/themes/index.jsp


Princeton university Dataset sources for Pakistan

The NICHD Data and Specimen Hub (DASH) is a centralized resource that allows researchers to share and access de-identified data from studies funded by NICHD. DASH also serves as a portal for requesting biospecimens from selected DASH studies.
 
 
The Data Online for Population, Health and Nutrition (DOLPHN) system is an online statistical data resource containing selected current and historical country-level demographic and health indicator data. The DOLPHN system is designed to provide users with quick and easy access to frequently used statistics and can be helpful as both a reference and analytical tool.

Stanford University Data sets (free) http://data.stanford.edu/

Interesting link for PhD students
http://www2.hud.ac.uk/research/gradcentre/links.php

UNICEF/WHO
sanitation and water


Open Source Publishing


Jstor Data
http://dfr.jstor.org/

Google and Wiley Interscience



Google public data visualization​The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate.

Harvard data
related to public health
The purpose of this website is to provide public health professionals, researchers, policy makers and students with a comprehensive catalog of Maternal and Child Health (MCH) data sets, interactive tools and other resources.

CDC Wonder
Wide-ranging Online Data for Epidemiologic Research


CDC newborn feeding practices datasets
http://www.cdc.gov/ifps/data/index.htm

CDC datasets on breastfeeding practices:
http://www.cdc.gov/breastfeeding/data/index.htm

The Cochare Library (great for public health publications)
http://www.thecochranelibrary.com/view/0/index.html?gclid=CPWu8trNy6ACFdNA6wodqn_u0A

JHUCCP research tool database
http://new.jhuccp.org/research/researchDB/


Pew Research Center Databases
You can download the data collected by Pew Research Center from here for their various national and international surveys (the religion project includes Pakistan).


PRB Data Finder


Research Gate

Professional network for scientists.


RAND data


A UH student analysis on different meta-data sources:
http://www2.hawaii.edu/~jacso/extra/

UN population data


US Census, international population statistics


World Bank Datasets


World Bank Data


World Values Survey

http://www.ipums.org/ Integrated Public Use Microdata Series from Minnesota University

National Bureau of Economic Research data from diff. sources related to American Demographics and Economics

Wednesday 25 July 2012

To ADD (or SUM) in SPSS

Well, in SPSS you can add a series of variables in two different ways. First is you add two variables i.e., boys and girls and get the total children. Or, you want to create an Index based on a series of scores but want to ignore the respondent who missed out on any of the variables in the series (i.e., there is a MISSING value in 1 or more variables for them).
 
compute t_child=sons+daughters.
OR
compute t_child=sum (sons, daughters).

"The difference between the two procedures above is that in the first procedure, the case on total would be missing if any one of the four variables had missing values on a case; in the second procedure, the total would be computed while ignoring missing values on the four variables." No cases will be dropped due to a missing value in any of the variables. "Essentially SPSS treats the missing value as ZERO." 

In the SUM argument the variables must be separated by comma but if there are multiple variables you can use the option of TO to provide a range. For example, if you want to construct a happiness index based on 12 indicators/variables hap1 thru hap12, you can use the following syntax:

compute happiness=sum (hap1 thru hap12).

Source: Indiana University IT Services and others.

 
Another point to note is that  "the SUM() function is evidently flexible enough to respect more complex statements like SUM(Var1+Var2, Var3-Var4, Var5*Var6).  Hence, do not use the addition symbol when you use SUM unless that is part of the list of arguments. Source: SPSSX Discussion group

While talking about the flexibility and greatness of SUM, there is another neat function that you can take note of. So, in case you want to limit the CASE DROPPING based on any MISSING values, you can provide a number to TELL the computer to keep a CASE/RESPONDENT if at least X # of variables are answered. So, 

COMPUTE V3 = SUM.2(V1, V2). EXECUTE .

"The .2 appended to the end of the SUM function in the above example can be any integer. Use it to indicate the minimum number of valid cases necessary to perform a given calculation." Source: Indiana University IT Services

Also remember Listwise and pairwise deletion a concept SPSS uses while using addition function. According to a discussion group they are defined as:

Listwise - then if the respondent has any missing value for any variable then the respondent is omitted from all your data analysis.

Pairwise - not as harsh as listwise in that the respondent is dropped only on analyses involving variables that have missing values.


Also check the IBM site and Psychwiki for more on list and pairwise deletion.

Tuesday 24 July 2012

Factor Analysis in short (not my writing)

What is Factor Analysis?*
"Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the number of variables in a model or to detect relationships among variables. All variables involved in the factor analysis need to be interval and are assumed to be normally distributed."

SPSS syntax:

factor
/variables read write math science socst
/criteria factors(2)
/extraction pc
/rotation varimax
/plot eigen.

Here is the syntax in SPSS from ANU course notes:

FACTOR
/VARIABLES q34_1 to q34_12
/MISSING LISTWISE /ANALYSIS q34_1 to q34_12
/PRINT INITIAL KMO REPR EXTRACTION ROTATION
/CRITERIA MINEIGEN(1) ITERATE(25)
/FORMAT SORT
/EXTRACTION PAF
/CRITERIA ITERATE(25)
/ROTATION VARIMAX
/METHOD=CORRELATION .

Crate SCALE using FA

FACTOR
/VARIABLES q34_1 to q34_12
/MISSING LISTWISE /ANALYSIS q34_1 to q34_12
/PRINT INITIAL KMO REPR EXTRACTION ROTATION
/CRITERIA MINEIGEN(1) ITERATE(25)
/FORMAT SORT
/EXTRACTION PAF
/PLOT EIGEN
/CRITERIA ITERATE(25)
/ROTATION VARIMAX
/SAVE REG (2)
/METHOD=CORRELATION .



*Introduction to SAS. UCLA: Academic Technology Services, Statistical Consulting Group. from http://www.ats.ucla.edu/stat/sas/notes2/ (accessed November 24, 2007).

Measuring unmet need for family planning.

Preamble
“Millions of women would prefer to avoid becoming pregnant either right away or ever, but are not using contraception. These women have an unmet need for family planning. Programs can serve many of these women by developing strategies that respond directly to their concern.” Ref: Population Reports, Sept 1996.

Unmet need is defined on the basis of women’s responses to survey questions and following are some of the definitions that have been used since 1970’s.

The KAP-Gap
Definition one: Women who wanted to have no more children but were not using contraception. (Ignored spacers, exposure to risk of pregnancy)

The world fertility survey (WFS 1972-1984)
Definition two: Same as above but excluded pregnant and amenorrheic women, because they did not currently need contraception. (Ignored spacers)

In 1981, John Anderson and Leo Morris measured the percentage of women of reproductive age who are “exposed to the risk of unintended pregnancy and are not using contraceptive”. (Included spacers). Next year Nortman and Gary developed a model by including pregnant, breast feeding, or amenorrheic in the definition of unmet need.

After ICPD 1994, Sinding and Fathalla have suggested to measure unmet need more broadly including unmet need among people who are using contraception but may be dissatisfied with their method. By using both qualitative and quantitative data, they suggest experience with sideeffects, discontinuation and other problems of contraception could help extend the focus of unmet need from use of any method to the quality of care.

Arguments over who is at risk, should we include inappropriate method use and method failure. DHS started asking questions on intentions about current pregnancy, therefore, including pregnant women. Recently included category is unmarried women. In short, include all women who are “at risk” of an unintended or mistimed pregnancy.

Considering the importance of measurement of unmet need, now all DHS and FP/RH Survey questionnaire ask about extended definition of unmet need.


Casterline (1997) pointed out that there can be inaccuracies in the reporting of contraceptive use and in the reporting of fertility preferences, and both pieces of information are required for estimating unmet need. Furthermore, his work shows that unmet need is subject to different definitions, and its measurement is not straightforward. Therefore, any survey undertaken for the measurement of unmet need must consider issues of definition in advance.


Following chart shows the standard formulation of unmet need.




Naming multiple variables at the same time, with syntax

Of course, it has to be with SYNTAX... I like to do everything with Syntax because of so many reasons but mostly to keep a log of what I am doing and secondly to reduce the key strokes I have to make for repeated jobs!!!


So, in case you have some variables for which you have to assign VALUE AND VARIABLE LABELS wouldn't it be handy to if you are able to do them with one command.I know it is a small thing and most people who use SPSS would laugh at me for even writing a blog entry on this, but believe me it is easy to forget little things especially if you go out of touch for a year or two. So, here is the command:

VAR LAB

q29a '(RHC)Number of hours facility open for consultantion'
/q29b 'Number of hours facility open for consultantion BHU'
/q29c 'Number of hours facility open for consultantion MCH center'
/q29d 'Number of hours facility open for consultantion Dispensary'
/q29e 'Number of hours facility open for consultantion govt hospital'
/q29f 'Number of hours facility open for consultantion Pvt hospital'
/q29g 'Number of hours facility open for consultantion Dispensary/Compoder'
/q29h 'Number of hours facility open for consultantion Nurse/LHV'
/q29i 'Number of hours facility open for consultantion Hakeem/ Homeopath'
/q29j 'Number of hours facility open for consultantion FWC'
/q29k 'Number of hours facility open for consultantion 'RHS-A'
/q29l '(Others)Number of hours facility open for consultantion'.


Please note in the above syntax, After VAR LAB the first variable name is written as is, but the rest precede with a backslash "/".

Data transfer (migration) from Access to SPSS

Problem:

Your data has been entered in MS Access, where all the variables (fields) have been defined names, width, type etc. and there are look up arrays/tables linked with each of the fields to describe the Response Values. But then you need to run some stats in SPSS. So, you basically EXPORT the file to some data analysis software like SPSS. One way to do it is export to MS Excel format and open/Import the Excel sheet into SPSS, which is pretty straightforward and simple. But then you examine the file and you will notice that in this transition, all the nifty labels of fields and values are gone and you will have to either make guesses or look at your data collection instruments to make sense of the numbers.

So, the choice you have is either to keep doing that or manually assign all the labels in SPSS. It is fine if you have only a handful of variables, but if you have a long list it is a lot of work!!!

What do you do? I have been trying to get around this problem for months now with no success. There is a Script on my favorite SPSS site:
http://www.spsstools.net/Scripts/ImportExport/ExportLabelsFromAccessToSPSS.txt

which should do the needful but I am certainly not doing it right. Need help. All the google search and various discussion groups have proven to be of no use also. Apparently I can create a link through ODBC but I am too lame to figure that out...

Any help?

Update 1: No luck with VB or Python or ODBC etc. because I am too dumb to learn them on my own! However, I learnt that if you have to do that a lot, there is a handy program that can do it for you. It is called Stat/Transfer. One caveat is that it is not FREE. The student version costs $59. In future I would buy it if I am stuck with multiple transfers between various databases. In the past they also used to have DBMS copy for such things but it does not exist anymore. I have tried to search for a Open Source version for Stat/Transfer but no luck yet!!

Thursday 26 January 2012

Entering quantitative data into the computer

Well there are many many ways you can enter the quantitative data (generated from surveys) into a computer. Most of the people I know use either use a simple Excel table or very very complex MS Access data entry file. However both of these programs are not created for data entry purposes and hence are not so handy for that purpose.

There are several software out there in the market which you can buy to do the job, or you can outsource the data for entry or create a program for the very purpose yourself. All of these options require extra resources. SPSS, the most popular data analysis program also has a data entry module which is great but that is not free.

The best option would be to have a data entry program which is made just for that purpose, is easy to learn and use and most importantly is free. Thankfully there are 2 such software:

1) CSPro
"The Census and Survey Processing System (CSPro) is a public domain software package used by hundreds of organizations and tens of thousands of individuals for entering, editing, tabulating, and disseminating census and survey data."

2) EpiData
"EpiData Entry is used for simple or programmed data entry and data documentation. Entry handles simple forms or related systems Optimised documentation."

In my view using using Excel and Access have their strength but for any survey with more than a page long questionnaire and more than 40-50 respondents, it is better to use one or the other of the above.

Range checks, value lengths, logical jumps, automatically calculated fields, data validation on double entry, find duplicates, assign values for missing or not applicable values etc. are some the things that needs to be done to maintain the integrity of data. Both of the above software can do that.

The files that are created in these programs can easily be exported to major analytical packages like SPSS and STATA.

Thursday 12 January 2012

Converting a table to Data File

I have been looking for this for so long.....

Thank you Mr. John Walkenbach


Here is a nifty Macro in Excel to do that!!!
Sub ReversePivotTable()
' Before running this, make sure you have a summary table with column headers.
' The output table will have three columns.
Dim SummaryTable As Range, OutputRange As Range
Dim OutRow As Long
Dim r As Long, c As Long
On Error Resume Next
Set SummaryTable = ActiveCell.CurrentRegion
If SummaryTable.Count = 1 Or SummaryTable.Rows.Count < 3 Then
MsgBox "Select a cell within the summary table.", vbCritical
Exit Sub
End If
SummaryTable.Select
Set OutputRange = Application.InputBox(prompt:="Select a cell for the 3-column output", Type:=8) '
Convert the range OutRow = 2 Application.ScreenUpdating = False
OutputRange.Range("A1:C3") = Array("Column1", "Column2", "Column3")
For r = 2 To SummaryTable.Rows.Count
For c = 2 To SummaryTable.Columns.Count
OutputRange.Cells(OutRow, 1) = SummaryTable.Cells(r, 1)
OutputRange.Cells(OutRow, 2) = SummaryTable.Cells(1, c)
OutputRange.Cells(OutRow, 3) = SummaryTable.Cells(r, c)
OutputRange.Cells(OutRow, 3).NumberFormat = SummaryTable.Cells(r, c).NumberFormat
OutRow = OutRow + 1
Next c
Next r
End Sub

VlookUp in Excel

Learnt a lot about VlookUp today. It is really one of the most powerful features of MS Excel.

Few things to remember.
The formula will return a "0" (without quotes) if the index variable was matched but there as not corresponding value to return. If the index variable does not match, Excel will return #NA.

-remember the column number you want to retrieve. This column number is counted from the column where you "looked" for the index variable.

-You can lookup in another file.
-Linked files get updated even if they are not open.
-To keep track of what changes has been made it is a good idea to use the "TRACK CHANGES" feature.
-Excel cares about trailing spaces but does not care for difference in CASE and the type of the cell.
-Cells cannot be protected, hence you have to be very careful when you use vlookup.

Example of VLookup with data in 2 different sheets.

=VLOOKUP(A:A,Sheet3!A:B,2,FALSE)

Another one,
=VLOOKUP(O2,B2:K41,5)

In this example O2 (the first condition) is the address of the value that you want to MATCH, for example registration number, ssn etc. (the index) with  the source data file. You can also provide a value here, but giving a cell address makes it easier to copy a range. The second term, B2:K41 is the range of source data in which the first column includes your INDEX variable. The third value, 5 is the column number after the index. For example if your INDEX is in column B and your VALUE (ie name) is in column E, you should write 4 here). Make sure to remember that you need to select the complete range of columns for the lookup. For example, if you have school names in column A,  freq of particular symptom in column B and you want to get the total number of health room visits from columns E (while D has the corresponding school name), you need to select both column D and E for the second condition (not just one column), then put 1 for the index number.

Comparing/Updating multiple files against a Master file

UPDATE FILE command can be used for that. More text to add later.

Meanwhile check this page.
http://www.ats.ucla.edu/stat/spss/faq/update.htm

Tuesday 10 January 2012

Excel oddities

Check this page for some oddities and quirks related to Excel.

http://spreadsheetpage.com/index.php/oddities

Thursday 5 January 2012

Working with a sub-group of data

Sometimes you want to include only a sub-set or sample of the data in your analysis. There are different ways of doing that.

1. Using a sample of the data
If you just want to randomly choose a percentage of the data to work on it, here is the formula to select 30% of the responses:
TEMPORARY. SAMPLE .30.
You can also select a number of cases, ie:
SAMPLE 50 from 100.
will choose 50 cases randomly from the first 100 cases.

Note: Adding the "TEMPORARY" before makes sure that your original data does not change.

2. Selected group of cases

You can use select if (by itself or with temporary) on the basis of a criteria to run your statistics. For example, if you want to see the sleep time for mothers with under 2 year old children, you may have to select Female in the sex field (if there is one) and the age of child. This command can also be used with SYSMIS (system missing values) also.

3. Filter

You can filter the data according to a criteria assigned in a variable ie work only with boys or girls sample or children who are in school or not in school. The only thing to remember is that the variable has to be a dummy variable (with 0 and 1 values). The filter "turns off" the zeros, ie if the variable gender has assigned 0 to boys and 1 to girls, when you use filter it will generate statistics only for girls. To do so,

FILTER BY variable name.


When you are done, don't forget to use:
FILTER OFF.

4. Split file

First of all sort the cases by the variable you want to split the file with.

SORT CASES BY variable name.
Now use the split command.

SPLIT FILE BY variable name.

Don't forget to do turn the split function off.

SPLIT FILE OFF.


(thanks to UCLA SPSS Learning Modules for their online guide)